Hi the schedd is experiencing a non standard behaviour. After a schedd restart or even a reboot of the server all the already running shadows are not respawned anymore and the condor_q command does not report any running job. The jobs keep running on the execution machines until the lease expiration. I failed to reproduce this behaviour on a test schedd instance with the same configuration thanks in advance for any hint you would like to share with me Ale The following messages come from the production schedd with a non standard behaviour **** 06/28/17 03:52:09 (pid:217464) Shadow pid 823969 for job 50113.0 exited with status 112 06/28/17 03:52:09 (pid:217464) Putting job 50113.0 on hold 06/28/17 05:11:02 (pid:1093541) Setting maximum file descriptors to 4096. 06/28/17 05:11:02 (pid:1093541) ****************************************************** 06/28/17 05:11:02 (pid:1093541) ** condor_schedd (CONDOR_SCHEDD) STARTING UP 06/28/17 05:11:02 (pid:1093541) ** /usr/sbin/condor_schedd 06/28/17 05:11:02 (pid:1093541) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1) 06/28/17 05:11:02 (pid:1093541) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON 06/28/17 05:11:02 (pid:1093541) ** $CondorVersion: 8.4.6 Apr 20 2016 BuildID: 364106 $ 06/28/17 05:11:02 (pid:1093541) ** $CondorPlatform: x86_64_RedHat6 $ 06/28/17 05:11:02 (pid:1093541) ** PID = 1093541 06/28/17 05:11:02 (pid:1093541) ** Log last touched 6/28 03:52:09 06/28/17 05:11:02 (pid:1093541) ****************************************************** 06/28/17 05:11:02 (pid:1093541) Using config source: /etc/condor/condor_config 06/28/17 05:11:02 (pid:1093541) Using local config sources: 06/28/17 05:11:02 (pid:1093541) /etc/condor/config.d/condor_config_base 06/28/17 05:11:02 (pid:1093541) /etc/condor/config.d/condor_config_history 06/28/17 05:11:02 (pid:1093541) /etc/condor/config.d/condor_config_jobs 06/28/17 05:11:02 (pid:1093541) /etc/condor/config.d/condor_config_scheduler 06/28/17 05:11:02 (pid:1093541) /etc/condor/config.d/condor_config_security 06/28/17 05:11:02 (pid:1093541) config Macros = 88, Sorted = 88, StringBytes = 3048, TablesBytes = 3248 06/28/17 05:11:02 (pid:1093541) CLASSAD_CACHING is ENABLED 06/28/17 05:11:02 (pid:1093541) Daemon Log is logging: D_ALWAYS D_ERROR 06/28/17 05:11:02 (pid:1093541) SharedPortEndpoint: waiting for connections to named socket 217453_9047_12 06/28/17 05:11:02 (pid:1093541) DaemonCore: command socket at <90.147.169.224:9618?addrs=90.147.169.224-9618&noUDP&sock=217453_9047_12> 06/28/17 05:11:02 (pid:1093541) DaemonCore: private command socket at <90.147.169.224:9618?addrs=90.147.169.224-9618&noUDP&sock=217453_9047_12> 06/28/17 05:11:02 (pid:1093541) History file rotation is enabled. 06/28/17 05:11:02 (pid:1093541) Maximum history file size is: 1073741824 bytes 06/28/17 05:11:02 (pid:1093541) Number of rotated history files is: 365 06/28/17 05:11:02 (pid:1093541) Failed to execute /usr/sbin/condor_shadow.std, ignoring 06/28/17 05:11:37 (pid:1093541) About to rotate ClassAd log /var/lib/condor/spool/job_queue.log 06/28/17 05:11:39 (pid:1093541) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s 06/28/17 05:11:39 (pid:1093541) TransferQueueManager upload 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load 06/28/17 05:11:39 (pid:1093541) TransferQueueManager download 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load **** The following messages come from the test schedd with a standard behaviour [shadow are respawned] [in this example the schedd received a âkill -9"] **** 06/28/17 11:38:10 (pid:2206) Number of Active Workers 0 06/28/17 11:38:21 (pid:11089) Setting maximum file descriptors to 4096. 06/28/17 11:38:21 (pid:11089) ****************************************************** 06/28/17 11:38:21 (pid:11089) ** condor_schedd (CONDOR_SCHEDD) STARTING UP 06/28/17 11:38:21 (pid:11089) ** /usr/sbin/condor_schedd 06/28/17 11:38:21 (pid:11089) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1) 06/28/17 11:38:21 (pid:11089) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON 06/28/17 11:38:21 (pid:11089) ** $CondorVersion: 8.4.9 Sep 29 2016 BuildID: 382747 $ 06/28/17 11:38:21 (pid:11089) ** $CondorPlatform: x86_64_RedHat6 $ 06/28/17 11:38:21 (pid:11089) ** PID = 11089 06/28/17 11:38:21 (pid:11089) ** Log last touched 6/28 11:38:10 06/28/17 11:38:21 (pid:11089) ****************************************************** 06/28/17 11:38:21 (pid:11089) Using config source: /etc/condor/condor_config 06/28/17 11:38:21 (pid:11089) Using local config sources: 06/28/17 11:38:21 (pid:11089) /etc/condor/config.d/condor_config_base 06/28/17 11:38:21 (pid:11089) /etc/condor/config.d/condor_config_history 06/28/17 11:38:21 (pid:11089) /etc/condor/config.d/condor_config_jobs 06/28/17 11:38:21 (pid:11089) /etc/condor/config.d/condor_config_scheduler 06/28/17 11:38:21 (pid:11089) /etc/condor/config.d/condor_config_security 06/28/17 11:38:21 (pid:11089) /etc/condor/config.d/condor_config_sub_expr 06/28/17 11:38:21 (pid:11089) config Macros = 80, Sorted = 80, StringBytes = 2537, TablesBytes = 2968 06/28/17 11:38:21 (pid:11089) CLASSAD_CACHING is ENABLED 06/28/17 11:38:21 (pid:11089) Daemon Log is logging: D_ALWAYS D_ERROR 06/28/17 11:38:22 (pid:11089) SharedPortEndpoint: waiting for connections to named socket 2156_baae_5 06/28/17 11:38:22 (pid:11089) DaemonCore: command socket at <90.147.168.55:9618?addrs=90.147.168.55-9618&noUDP&sock=2156_baae_5> 06/28/17 11:38:22 (pid:11089) DaemonCore: private command socket at <90.147.168.55:9618?addrs=90.147.168.55-9618&noUDP&sock=2156_baae_5> 06/28/17 11:38:22 (pid:11089) History file rotation is enabled. 06/28/17 11:38:22 (pid:11089) Maximum history file size is: 1073741824 bytes 06/28/17 11:38:22 (pid:11089) Number of rotated history files is: 365 06/28/17 11:38:22 (pid:11089) Failed to execute /usr/sbin/condor_shadow.std, ignoring 06/28/17 11:38:22 (pid:11089) About to rotate ClassAd log /var/lib/condor/spool/job_queue.log 06/28/17 11:38:22 (pid:11089) Starting add_shadow_birthdate(289.0) 06/28/17 11:38:22 (pid:11089) Started shadow for job 289.0 on <90.147.168.249:60611> for group_cms.local.italiano, (shadow pid = 11092) 06/28/17 11:38:22 (pid:11089) Starting add_shadow_birthdate(291.0) 06/28/17 11:38:22 (pid:11089) Started shadow for job 291.0 on <90.147.169.78:44253> for group_cms.local.italiano, (shadow pid = 11095) 06/28/17 11:38:22 (pid:11089) Starting add_shadow_birthdate(290.0) 06/28/17 11:38:22 (pid:11089) Started shadow for job 290.0 on <90.147.169.168:49712> for group_cms.local.italiano, (shadow pid = 11098) 06/28/17 11:38:22 (pid:11089) Starting add_shadow_birthdate(292.0) 06/28/17 11:38:22 (pid:11089) Started shadow for job 292.0 on <90.147.168.147:41189> for group_cms.local.italiano, (shadow pid = 11101) 06/28/17 11:38:27 (pid:11089) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s 06/28/17 11:38:27 (pid:11089) TransferQueueManager upload 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load 06/28/17 11:38:27 (pid:11089) TransferQueueManager download 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load **** |
Attachment:
smime.p7s
Description: S/MIME cryptographic signature