ah, sorry - I forgot to mention the ShadowLog, which also stopped at 2:35 [1] with function doWriteEvent() failing to write logs. AFAIS disk space/inodes have been sufficient and memory consumption stayed also normal [1] 10/17/16 02:35:16 (262870.0) (1731674): WriteUserLog checking for event log rotation, but no lock 10/17/16 02:35:16 (262870.0) (1731674): WriteUserLog checking for event log rotation, but no lock 10/17/16 02:35:18 (262865.0) (1729344): WriteUserLog checking for event log rotation, but no lock 10/17/16 02:35:18 (262865.0) (1729344): WARNING: WriteUserLog::writeEvent global doWriteEvent() failed on global log! The global event log will be missing an event. 10/17/16 02:35:18 (262865.0) (1729344): WriteUserLog checking for event log rotation, but no lock 10/17/16 02:35:18 (262813.0) (1731484): WriteUserLog checking for event log rotation, but no lock 10/17/16 02:35:18 (262813.0) (1731484): WARNING: WriteUserLog::writeEvent global doWriteEvent() failed on global log! The global event log will be missing an event. 10/17/16 02:35:18 (262813.0) (1731484): WriteUserLog checking for event log rotation, but no lock 10/17/16 02:35:20 (255809.0) (1730692): WriteUserLog checking for event log rotation, but no lock 10/17/16 02:35:20 (255809.0) (1730692): WARNING: WriteUserLog::writeEvent global doWriteEvent() failed on global log! The global event log will be missing an event. 10/17/16 02:35:20 (255809.0) (1730692): WriteUserLog checking for event log rotation, but no lock 10/17/16 02:35:25 (255722.0) (1731636): WriteUserLog checking for event log rotation, but no lock 10/17/16 02:35:25 (255722.0) (1731636): WARNING: WriteUserLog::writeEvent global doWriteEvent() failed on global log! The global event log will be missing an event. 10/17/16 02:35:25 (255722.0) (1731636): WriteUserLog checking for event log rotation, but no lock 10/17/16 02:35:26 (255721.0) (1731158): WriteUserLog checking for event log rotation, but no lock 10/17/16 02:35:26 (255721.0) (1731158): WARNING: WriteUserLog::writeEvent global doWriteEvent() failed on global log! The global event log will be missing an event. 10/17/16 02:35:26 (255721.0) (1731158): WriteUserLog checking for event log rotation, but no lock 10/17/16 02:35:26 (255737.0) (1731084): Write10/17/16 10:21:41 ****************************************************** 10/17/16 10:21:41 ** condor_shadow (CONDOR_SHADOW) STARTING UP On 2016-10-17 11:44, Thomas Hartmann wrote: > Hi all, > > tonight one of our schedulers (on a SL6.7 grid ARC CE) stopped working. > While I cannot rule out the ARC itself, I suspect a problem with the > Condor Schedd. > > Job submission failed around 2:00 tonight after which the SchedLog [1] > contained only of lines as > > WriteUserLog checking for event log rotation, but no lock > which occured before as well but not solely. > > A bit later at ~2:16 the MasterLog [2] started to log sched daemons to > be reaped/to die(?) exiting with code 44. Restarts of the schedd went on > for ~20m after which the MasterLog went silent until the service got > restarted. > > I found so far no information on the schedd error code 44 but only for > the shadow [3]. > There was a mail exchange >10 year ago, when a shadow exiting with error > 44 took down the schedd as well > > https://www-auth.cs.wisc.edu/lists/htcondor-users/2004-November/msg00229.shtml > but I don't know, if this could still be of relevance (I found no > "*dprintf*" cores)? > > The system logs contained no messages with obvious correlations. > > Maybe somebody has an idea how to debug further what went wrong? > > Cheers and thanks, > Thomas > > > [1] >> condor/SchedLog > 1856 10/17/16 01:59:54 (pid:4168039) WriteUserLog checking for event > log rotation, but no lock > 1857 10/17/16 01:59:54 (pid:4168039) WriteUserLog checking for event > log rotation, but no lock > 1858 10/17/16 01:59:54 (pid:4168039) WriteUserLog checking for event > log rotation, but no lock > 1859 10/17/16 02:01:02 (pid:4168039) WriteUserLog checking for event > log rotation, but no lock > 1860 10/17/16 02:01:02 (pid:4168039) WriteUserLog checking for event > log rotation, but no lock > > ... > > 10/17/16 10:42:19 (pid:1968938) WriteUserLog checking for event log > rotation, but no lock > 10/17/16 10:42:19 (pid:1968938) WriteUserLog checking for event log > rotation, but no lock > 10/17/16 10:42:19 (pid:1968938) WriteUserLog checking for event log > rotation, but no lock > 10/17/16 10:42:19 (pid:1968938) WriteUserLog checking for event log > rotation, but no lock > 10/17/16 10:42:19 (pid:1968938) Starting add_shadow_birthdate(253638.0) > 10/17/16 10:42:19 (pid:1968938) Started shadow for job 253638.0 on > <131.169.161.82:9620?addrs=131.169.161.82-9620&noUDP&sock=5184_1b9d_3> > for group_CMS.cms_multicore.cmsplt036, (shadow pid = 1969011) > 10/17/16 10:42:19 (pid:1968938) TransferQueueManager stats: active > up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s > 10/17/16 10:42:19 (pid:1968938) TransferQueueManager upload 1m I/O load: > 0 bytes/s 0.000 disk load 0.000 net load > 10/17/16 10:42:19 (pid:1968938) TransferQueueManager download 1m I/O > load: 0 bytes/s 0.000 disk load 0.000 net load > 10/17/16 10:42:19 (pid:1968938) Sent ad to central manager for > group_BELLE2.other.belleprd001@xxxxxxx > 10/17/16 10:42:19 (pid:1968938) Sent ad to 1 collectors for > group_BELLE2.other.belleprd001@xxxxxxx > > > [2] > condor/MasterLog > 10/15/16 12:38:39 DefaultReaper unexpectedly called on pid 1197280, > status 0. > 10/16/16 12:38:39 Preen pid is 3854283 > 10/16/16 12:38:39 DefaultReaper unexpectedly called on pid 3854283, > status 0. > 10/17/16 02:16:36 DefaultReaper unexpectedly called on pid 4168039, > status 11264. > 10/17/16 02:16:36 The SCHEDD (pid 4168039) exited with status 44 > 10/17/16 02:16:37 Sending obituary for "/usr/sbin/condor_schedd" > 10/17/16 02:16:37 restarting /usr/sbin/condor_schedd in 10 seconds > 10/17/16 02:16:47 Started DaemonCore process "/usr/sbin/condor_schedd", > pid and pgroup = 1712607 > 10/17/16 02:28:34 DefaultReaper unexpectedly called on pid 1712607, > status 1024. > 10/17/16 02:28:34 The SCHEDD (pid 1712607) exited with status 4 > 10/17/16 02:28:34 Sending obituary for "/usr/sbin/condor_schedd" > 10/17/16 02:28:34 restarting /usr/sbin/condor_schedd in 10 seconds > 10/17/16 02:28:44 Started DaemonCore process "/usr/sbin/condor_schedd", > pid and pgroup = 1728616 > 10/17/16 02:35:26 DefaultReaper unexpectedly called on pid 1728616, > status 11264. > 10/17/16 02:35:26 The SCHEDD (pid 1728616) exited with status 44 > ... > 10/17/16 02:35:48 SharedPortEndpoint: failed to open > /var/lock/condor/shared_port_ad: No such file or directory > 10/17/16 02:35:48 Started DaemonCore process "/usr/sbin/condor_schedd", > pid and pgroup = 1739389 > 10/17/16 02:35:48 Collector port not defined, will use default: 9618 > 10/17/16 02:35:48 Started DaemonCore process > "/usr/libexec/condor/condor_shared_port", pid and pgroup = 1739390 > 10/17/16 02:35:48 DefaultReaper unexpectedly called on pid 1739389, > status 11264. > 10/17/16 02:35:48 The SCHEDD (pid 1739389) exited with status 44 > 10/17/16 02:35:48 Sending obituary for "/usr/sbin/condor_schedd" > 10/17/16 02:35:49 restarting /usr/sbin/condor_schedd in 13 seconds > 10/17/16 02:35:49 DefaultReaper unexpectedly called on pid 1739390, > status 11264. > 10/17/16 02:35:49 The SHARED_PORT (pid 1739390) exited with status 44 > 10/17/16 02:35:49 Sending obituary for > "/usr/libexec/condor/condor_shared_port" > 10/17/16 02:35:49 restarting /usr/libexec/condor/condor_shared_port in > 13 seconds > 10/17/16 02:36:02 SharedPortEndpoint: failed to open > /var/lock/condor/shared_port_ad: No such file or directory > 10/17/16 02:36:02 Started DaemonCore process "/usr/sbin/condor_schedd", > pid and pgroup = 1739409 > 10/17/16 02:36:02 Collector port not defined, will use default: 9618 > 10/17/16 02:36:02 Started DaemonCore process > "/usr/libexec/condor/condor_shared_port", pid and pgroup = 1739410 > 10/17/16 02:36:02 DefaultReaper unexpectedly called on pid 1739409, > status 11264. > 10/17/16 02:36:02 The SCHEDD (pid 1739409) exited with status 44 > 10/17/16 02:36:02 Sending obituary for "/usr/sbin/condor_schedd" > 10/17/16 02:36:02 restarting /usr/sbin/condor_schedd in 17 seconds > 10/17/16 02:36:02 DefaultReaper unexpectedly called on pid 1739410, > status 11264. > 10/17/16 02:36:02 The SHARED_PORT (pid 1739410) exited with status 44 > 10/17/16 02:36:02 Sending obituary for > "/usr/libexec/condor/condor_shared_port" > 10/17/16 02:36:02 restarting /usr/libexec/condor/condor_shared_port in > 17 seconds > 10/17/16 02:36:19 SharedPortEndpoint: failed to open > /var/lock/condor/shared_port_ad: No such file or directory > 10/17/16 02:36:19 Started DaemonCore process "/usr/sbin/condor_schedd", > pid 10/17/16 10:21:02 ****************************************************** > 10/17/16 10:21:02 ** condor_master (CONDOR_MASTER) STARTING UP > > > [3] > http://pages.cs.wisc.edu/~adesmet/status.html >> shadow > 44 DPRINTF_ERROR There is a fatal error with dprintf() >> GRAM Error Codes > 44 STAGING_STDIN the job manager failed to stage the stdin file > > > > _______________________________________________ > HTCondor-users mailing list > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a > subject: Unsubscribe > You can also unsubscribe by visiting > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users > > The archives can be found at: > https://lists.cs.wisc.edu/archive/htcondor-users/ >
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature