Hi all, tonight one of our schedulers (on a SL6.7 grid ARC CE) stopped working. While I cannot rule out the ARC itself, I suspect a problem with the Condor Schedd. Job submission failed around 2:00 tonight after which the SchedLog [1] contained only of lines as > WriteUserLog checking for event log rotation, but no lock which occured before as well but not solely. A bit later at ~2:16 the MasterLog [2] started to log sched daemons to be reaped/to die(?) exiting with code 44. Restarts of the schedd went on for ~20m after which the MasterLog went silent until the service got restarted. I found so far no information on the schedd error code 44 but only for the shadow [3]. There was a mail exchange >10 year ago, when a shadow exiting with error 44 took down the schedd as well https://www-auth.cs.wisc.edu/lists/htcondor-users/2004-November/msg00229.shtml but I don't know, if this could still be of relevance (I found no "*dprintf*" cores)? The system logs contained no messages with obvious correlations. Maybe somebody has an idea how to debug further what went wrong? Cheers and thanks, Thomas [1] > condor/SchedLog 1856 10/17/16 01:59:54 (pid:4168039) WriteUserLog checking for event log rotation, but no lock 1857 10/17/16 01:59:54 (pid:4168039) WriteUserLog checking for event log rotation, but no lock 1858 10/17/16 01:59:54 (pid:4168039) WriteUserLog checking for event log rotation, but no lock 1859 10/17/16 02:01:02 (pid:4168039) WriteUserLog checking for event log rotation, but no lock 1860 10/17/16 02:01:02 (pid:4168039) WriteUserLog checking for event log rotation, but no lock ... 10/17/16 10:42:19 (pid:1968938) WriteUserLog checking for event log rotation, but no lock 10/17/16 10:42:19 (pid:1968938) WriteUserLog checking for event log rotation, but no lock 10/17/16 10:42:19 (pid:1968938) WriteUserLog checking for event log rotation, but no lock 10/17/16 10:42:19 (pid:1968938) WriteUserLog checking for event log rotation, but no lock 10/17/16 10:42:19 (pid:1968938) Starting add_shadow_birthdate(253638.0) 10/17/16 10:42:19 (pid:1968938) Started shadow for job 253638.0 on <131.169.161.82:9620?addrs=131.169.161.82-9620&noUDP&sock=5184_1b9d_3> for group_CMS.cms_multicore.cmsplt036, (shadow pid = 1969011) 10/17/16 10:42:19 (pid:1968938) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s 10/17/16 10:42:19 (pid:1968938) TransferQueueManager upload 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load 10/17/16 10:42:19 (pid:1968938) TransferQueueManager download 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load 10/17/16 10:42:19 (pid:1968938) Sent ad to central manager for group_BELLE2.other.belleprd001@xxxxxxx 10/17/16 10:42:19 (pid:1968938) Sent ad to 1 collectors for group_BELLE2.other.belleprd001@xxxxxxx [2] condor/MasterLog 10/15/16 12:38:39 DefaultReaper unexpectedly called on pid 1197280, status 0. 10/16/16 12:38:39 Preen pid is 3854283 10/16/16 12:38:39 DefaultReaper unexpectedly called on pid 3854283, status 0. 10/17/16 02:16:36 DefaultReaper unexpectedly called on pid 4168039, status 11264. 10/17/16 02:16:36 The SCHEDD (pid 4168039) exited with status 44 10/17/16 02:16:37 Sending obituary for "/usr/sbin/condor_schedd" 10/17/16 02:16:37 restarting /usr/sbin/condor_schedd in 10 seconds 10/17/16 02:16:47 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 1712607 10/17/16 02:28:34 DefaultReaper unexpectedly called on pid 1712607, status 1024. 10/17/16 02:28:34 The SCHEDD (pid 1712607) exited with status 4 10/17/16 02:28:34 Sending obituary for "/usr/sbin/condor_schedd" 10/17/16 02:28:34 restarting /usr/sbin/condor_schedd in 10 seconds 10/17/16 02:28:44 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 1728616 10/17/16 02:35:26 DefaultReaper unexpectedly called on pid 1728616, status 11264. 10/17/16 02:35:26 The SCHEDD (pid 1728616) exited with status 44 ... 10/17/16 02:35:48 SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory 10/17/16 02:35:48 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 1739389 10/17/16 02:35:48 Collector port not defined, will use default: 9618 10/17/16 02:35:48 Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 1739390 10/17/16 02:35:48 DefaultReaper unexpectedly called on pid 1739389, status 11264. 10/17/16 02:35:48 The SCHEDD (pid 1739389) exited with status 44 10/17/16 02:35:48 Sending obituary for "/usr/sbin/condor_schedd" 10/17/16 02:35:49 restarting /usr/sbin/condor_schedd in 13 seconds 10/17/16 02:35:49 DefaultReaper unexpectedly called on pid 1739390, status 11264. 10/17/16 02:35:49 The SHARED_PORT (pid 1739390) exited with status 44 10/17/16 02:35:49 Sending obituary for "/usr/libexec/condor/condor_shared_port" 10/17/16 02:35:49 restarting /usr/libexec/condor/condor_shared_port in 13 seconds 10/17/16 02:36:02 SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory 10/17/16 02:36:02 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 1739409 10/17/16 02:36:02 Collector port not defined, will use default: 9618 10/17/16 02:36:02 Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 1739410 10/17/16 02:36:02 DefaultReaper unexpectedly called on pid 1739409, status 11264. 10/17/16 02:36:02 The SCHEDD (pid 1739409) exited with status 44 10/17/16 02:36:02 Sending obituary for "/usr/sbin/condor_schedd" 10/17/16 02:36:02 restarting /usr/sbin/condor_schedd in 17 seconds 10/17/16 02:36:02 DefaultReaper unexpectedly called on pid 1739410, status 11264. 10/17/16 02:36:02 The SHARED_PORT (pid 1739410) exited with status 44 10/17/16 02:36:02 Sending obituary for "/usr/libexec/condor/condor_shared_port" 10/17/16 02:36:02 restarting /usr/libexec/condor/condor_shared_port in 17 seconds 10/17/16 02:36:19 SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory 10/17/16 02:36:19 Started DaemonCore process "/usr/sbin/condor_schedd", pid 10/17/16 10:21:02 ****************************************************** 10/17/16 10:21:02 ** condor_master (CONDOR_MASTER) STARTING UP [3] http://pages.cs.wisc.edu/~adesmet/status.html > shadow 44 DPRINTF_ERROR There is a fatal error with dprintf() > GRAM Error Codes 44 STAGING_STDIN the job manager failed to stage the stdin file
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature