Hi Todd, many thanks for the lead! Actually, I may have got fooled by /var/log being a 'local' ext4 with sufficient space/inodes (disk I/O OKish as well) - while the node is actually a VM :-/ Maybe the scheduler node got 'thwarted' by some other guest or the underlying attached storage!? Going to check... Cheers and thanks, Thomas On 2016-10-17 19:29, Todd Tannenbaum wrote: > On 10/17/2016 4:44 AM, Thomas Hartmann wrote: >> Job submission failed around 2:00 tonight after which the SchedLog [1] >> contained only of lines as >> > WriteUserLog checking for event log rotation, but no lock >> which occured before as well but not solely. >> >> A bit later at ~2:16 the MasterLog [2] started to log sched daemons to >> be reaped/to die(?) exiting with code 44. Restarts of the schedd went on >> for ~20m after which the MasterLog went silent until the service got >> restarted. >> >> I found so far no information on the schedd error code 44 but only for >> the shadow [3]. > > > Hi Thomas, > > For any/all HTCondor daemons, exit code 44 means that the HTCondor > daemon in question failed to write to write or rotate a log file. I.e., > the write() system call failed. Look at the filesystems holding the > paths for LOG, LOCK, and/or EVENT_LOG via > condor_config_val LOG LOCK EVENT_LOG > > The most likely cause for this is the filesystem(s) in question were > full, or fell offline (i.e. NFS mount failed if they are not local). > BTW, I would encourage your to have these directories on local disk if > they are not already. > > Hope the above helps > Todd > > > > _______________________________________________ > HTCondor-users mailing list > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a > subject: Unsubscribe > You can also unsubscribe by visiting > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users > > The archives can be found at: > https://lists.cs.wisc.edu/archive/htcondor-users/
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature