Hi all, I have spawned a fresh scheduler [1] whose daemons seem to always getting killed shortly after they got created during a reboot. AFAIS, the daemons [2,3] get a SIGQUIT from the daemon core [4] - however, I do not get, why it triggered the actual shutdown [5]. After manually (re)starting condor's service, the daemons are running stable, so I wonder, why they got killed reproducible after their first start following reboots? Cheers and thanks for ideas, Thomas [1] condor-external-libs-8.6.8-2.el7.x86_64 condor-python-8.6.8-2.el7.x86_64 condor-8.6.8-2.el7.x86_64 condor-classads-8.6.8-2.el7.x86_64 condor-procd-8.6.8-2.el7.x86_64 [2] > Master aka PID:2358 >> MasterLog ... 03/26/18 11:08:06 Started DaemonCore process "/usr/libexec/condor/condor_defrag", pid and pgroup = 2406 03/26/18 11:08:47 Got SIGQUIT. Performing fast shutdown. 03/26/18 11:08:47 Sent SIGQUIT to DEFRAG (pid 2406) 03/26/18 11:08:47 Sent SIGQUIT to SCHEDD (pid 2405) 03/26/18 11:08:47 AllReaper unexpectedly called on pid 2405, status 0. 03/26/18 11:08:47 The SCHEDD (pid 2405) exited with status 0 03/26/18 11:08:47 AllReaper unexpectedly called on pid 2406, status 0. 03/26/18 11:08:47 The DEFRAG (pid 2406) exited with status 0 03/26/18 11:08:47 Sent SIGTERM to SHARED_PORT (pid 2398) 03/26/18 11:08:47 AllReaper unexpectedly called on pid 2398, status 0. 03/26/18 11:08:47 The SHARED_PORT (pid 2398) exited with status 0 03/26/18 11:08:47 All daemons are gone. Exiting. 03/26/18 11:08:47 **** condor_master (condor_MASTER) pid 2358 EXITING WITH STATUS 0 [3] Sched aka PID:2405 >> SchedLog ... 03/26/18 11:08:12 (pid:2405) TransferQueueManager download 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load 03/26/18 11:08:47 (pid:2405) Got SIGQUIT. Performing fast shutdown. 03/26/18 11:08:47 (pid:2405) Cron: Killing all jobs 03/26/18 11:08:47 (pid:2405) All shadows have been killed, exiting. 03/26/18 11:08:47 (pid:2405) **** condor_schedd (condor_SCHEDD) pid 2405 EXITING WITH STATUS 0 03/26/18 11:08:47 (pid:2405) Cron: Killing all jobs 03/26/18 11:08:47 (pid:2405) CronJobList: Deleting all jobs 03/26/18 11:08:47 (pid:2405) Cron: Killing all jobs 03/26/18 11:08:47 (pid:2405) CronJobList: Deleting all jobs [4] > ProcLog ... 03/26/18 11:08:05 : no methods have determined process 2131 to be in a monitored family 03/26/18 11:08:05 : ...snapshot complete 03/26/18 11:08:05 : PROC_FAMILY_REGISTER_SUBFAMILY 03/26/18 11:08:05 : taking a snapshot... 03/26/18 11:08:05 : method PARENT: found family 2358 for process 2398 03/26/18 11:08:05 : method PARENT: found family 2358 for process 2398 (already determined) 03/26/18 11:08:05 : ...snapshot complete 03/26/18 11:08:05 : moving process 2398 into new subfamily 2398 03/26/18 11:08:05 : new subfamily registered: root = 2398, watcher = 2358 03/26/18 11:08:05 : PROC_FAMILY_TRACK_FAMILY_VIA_ENVIRONMENT 03/26/18 11:08:06 : PROC_FAMILY_REGISTER_SUBFAMILY 03/26/18 11:08:06 : taking a snapshot... 03/26/18 11:08:06 : method PARENT: found family 2358 for process 2405 03/26/18 11:08:06 : method PARENT: found family 2358 for process 2405 (already determined) ... 03/26/18 11:08:06 : PROC_FAMILY_TRACK_FAMILY_VIA_ENVIRONMENT 03/26/18 11:08:47 : PROC_FAMILY_KILL_FAMILY 03/26/18 11:08:47 : taking a snapshot... 03/26/18 11:08:47 : process 2406 (of family 2406) has exited 03/26/18 11:08:47 : process 2405 (of family 2405) has exited 03/26/18 11:08:47 : process 1982 (not in monitored family) has exited 03/26/18 11:08:47 : process 1763 (not in monitored family) has exited 03/26/18 11:08:47 : process 1738 (not in monitored family) has exited 03/26/18 11:08:47 : process 1310 (not in monitored family) has exited 03/26/18 11:08:47 : process 542 (not in monitored family) has exited 03/26/18 11:08:47 : no methods have determined process 2413 to be in a monitored family 03/26/18 11:08:47 : no methods have determined process 2416 to be in a monitored family 03/26/18 11:08:47 : no methods have determined process 2417 to be in a monitored family 03/26/18 11:08:47 : no methods have determined process 2697 to be in a monitored family 03/26/18 11:08:47 : no methods have determined process 2745 to be in a monitored family 03/26/18 11:08:47 : no methods have determined process 2964 to be in a monitored family 03/26/18 11:08:47 : no methods have determined process 3004 to be in a monitored family 03/26/18 11:08:47 : ...snapshot complete 03/26/18 11:08:47 : sending signal 9 to family with root 2405 03/26/18 11:08:47 : PROC_FAMILY_UNREGISTER_FAMILY 03/26/18 11:08:47 : unregistering family with root pid 2405 03/26/18 11:08:47 : PROC_FAMILY_KILL_FAMILY 03/26/18 11:08:47 : taking a snapshot... 03/26/18 11:08:47 : ...snapshot complete 03/26/18 11:08:47 : sending signal 9 to family with root 2406 03/26/18 11:08:47 : PROC_FAMILY_UNREGISTER_FAMILY 03/26/18 11:08:47 : unregistering family with root pid 2406 03/26/18 11:08:47 : PROC_FAMILY_KILL_FAMILY 03/26/18 11:08:47 : taking a snapshot... 03/26/18 11:08:47 : process 2398 (of family 2398) has exited 03/26/18 11:08:47 : ...snapshot complete 03/26/18 11:08:47 : sending signal 9 to family with root 2398 03/26/18 11:08:47 : PROC_FAMILY_QUIT [5] > Sched aka PID:2405 >> grep "2405" ./* ./MasterLog:03/26/18 11:08:06 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 2405 ./MasterLog:03/26/18 11:08:47 Sent SIGQUIT to SCHEDD (pid 2405) ./MasterLog:03/26/18 11:08:47 AllReaper unexpectedly called on pid 2405, status 0. ./MasterLog:03/26/18 11:08:47 The SCHEDD (pid 2405) exited with status 0 ./ProcLog:03/26/18 11:08:06 : method PARENT: found family 2358 for process 2405 ./ProcLog:03/26/18 11:08:06 : method PARENT: found family 2358 for process 2405 (already determined) ./ProcLog:03/26/18 11:08:06 : moving process 2405 into new subfamily 2405 ./ProcLog:03/26/18 11:08:06 : new subfamily registered: root = 2405, watcher = 2358 ./ProcLog:03/26/18 11:08:47 : process 2405 (of family 2405) has exited ./ProcLog:03/26/18 11:08:47 : sending signal 9 to family with root 2405 ./ProcLog:03/26/18 11:08:47 : unregistering family with root pid 2405 ./SchedLog:03/26/18 11:08:06 (pid:2405) Setting maximum file descriptors to 4096. ./SchedLog:03/26/18 11:08:06 (pid:2405) ****************************************************** ./SchedLog:03/26/18 11:08:06 (pid:2405) ** condor_schedd (CONDOR_SCHEDD) STARTING UP ./SchedLog:03/26/18 11:08:06 (pid:2405) ** /usr/sbin/condor_schedd ./SchedLog:03/26/18 11:08:06 (pid:2405) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1) ./SchedLog:03/26/18 11:08:06 (pid:2405) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON ./SchedLog:03/26/18 11:08:06 (pid:2405) ** $CondorVersion: 8.6.8 Nov 13 2017 BuildID: 424045 $ ./SchedLog:03/26/18 11:08:06 (pid:2405) ** $CondorPlatform: x86_64_RedHat7 $ ./SchedLog:03/26/18 11:08:06 (pid:2405) ** PID = 2405 ./SchedLog:03/26/18 11:08:06 (pid:2405) ** Log last touched 3/26 11:06:38 ./SchedLog:03/26/18 11:08:06 (pid:2405) ****************************************************** ./SchedLog:03/26/18 11:08:06 (pid:2405) Using config source: /etc/condor/condor_config ./SchedLog:03/26/18 11:08:06 (pid:2405) Using local config sources: ./SchedLog:03/26/18 11:08:06 (pid:2405) /etc/condor/config.d/00arc_ce.conf ./SchedLog:03/26/18 11:08:06 (pid:2405) /etc/condor/config.d/02submitd.conf ./SchedLog:03/26/18 11:08:06 (pid:2405) /etc/condor/config.d/04defragd.conf ./SchedLog:03/26/18 11:08:06 (pid:2405) /etc/condor/condor_config.local ./SchedLog:03/26/18 11:08:06 (pid:2405) config Macros = 98, Sorted = 98, StringBytes = 4110, TablesBytes = 3600 ./SchedLog:03/26/18 11:08:06 (pid:2405) CLASSAD_CACHING is ENABLED ./SchedLog:03/26/18 11:08:06 (pid:2405) Daemon Log is logging: D_ALWAYS D_ERROR ./SchedLog:03/26/18 11:08:06 (pid:2405) SharedPortEndpoint: waiting for connections to named socket 2358_f868_3 ./SchedLog:03/26/18 11:08:06 (pid:2405) DaemonCore: command socket at <131.169.223.234:9620?addrs=131.169.223.234-9620+[2001-638-700-10df--1-ea]-9620&noUDP&sock=2358_f868_3> ./SchedLog:03/26/18 11:08:06 (pid:2405) DaemonCore: private command socket at <131.169.223.234:9620?addrs=131.169.223.234-9620+[2001-638-700-10df--1-ea]-9620&noUDP&sock=2358_f868_3> ./SchedLog:03/26/18 11:08:06 (pid:2405) History file rotation is enabled. ./SchedLog:03/26/18 11:08:06 (pid:2405) Maximum history file size is: 50000000 bytes ./SchedLog:03/26/18 11:08:06 (pid:2405) Number of rotated history files is: 5 ./SchedLog:03/26/18 11:08:06 (pid:2405) my_popenv: Failed to exec in child, errno=2 (No such file or directory) ./SchedLog:03/26/18 11:08:06 (pid:2405) Failed to execute /usr/sbin/condor_shadow.std, ignoring ./SchedLog:03/26/18 11:08:12 (pid:2405) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s ./SchedLog:03/26/18 11:08:12 (pid:2405) TransferQueueManager upload 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load ./SchedLog:03/26/18 11:08:12 (pid:2405) TransferQueueManager download 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load ./SchedLog:03/26/18 11:08:47 (pid:2405) Got SIGQUIT. Performing fast shutdown. ./SchedLog:03/26/18 11:08:47 (pid:2405) Cron: Killing all jobs ./SchedLog:03/26/18 11:08:47 (pid:2405) All shadows have been killed, exiting. ./SchedLog:03/26/18 11:08:47 (pid:2405) **** condor_schedd (condor_SCHEDD) pid 2405 EXITING WITH STATUS 0 ./SchedLog:03/26/18 11:08:47 (pid:2405) Cron: Killing all jobs ./SchedLog:03/26/18 11:08:47 (pid:2405) CronJobList: Deleting all jobs ./SchedLog:03/26/18 11:08:47 (pid:2405) Cron: Killing all jobs ./SchedLog:03/26/18 11:08:47 (pid:2405) CronJobList: Deleting all jobs [6] > Master aka PID:2358 >> grep 2358 ./* ./MasterLog:03/26/18 11:08:04 ** PID = 2358 ./MasterLog:03/26/18 11:08:05 SharedPortEndpoint: waiting for connections to named socket 2358_f868 ./MasterLog:03/26/18 11:08:05 DaemonCore: private command socket at <131.169.223.234:0?sock=2358_f868> ./MasterLog:03/26/18 11:08:47 **** condor_master (condor_MASTER) pid 2358 EXITING WITH STATUS 0 ./ProcLog:03/26/18 11:08:05 : Procd has a watcher pid and will die if pid 2358 dies. ./ProcLog:03/26/18 11:08:05 : method PID: found family 2358 for process 2358 ./ProcLog:03/26/18 11:08:05 : method PARENT: found family 2358 for process 2397 ./ProcLog:03/26/18 11:08:05 : method PARENT: found family 2358 for process 2397 (already determined) ./ProcLog:03/26/18 11:08:05 : method PARENT: found family 2358 for process 2398 ./ProcLog:03/26/18 11:08:05 : method PARENT: found family 2358 for process 2398 (already determined) ./ProcLog:03/26/18 11:08:05 : new subfamily registered: root = 2398, watcher = 2358 ./ProcLog:03/26/18 11:08:06 : method PARENT: found family 2358 for process 2405 ./ProcLog:03/26/18 11:08:06 : method PARENT: found family 2358 for process 2405 (already determined) ./ProcLog:03/26/18 11:08:06 : new subfamily registered: root = 2405, watcher = 2358 ./ProcLog:03/26/18 11:08:06 : method PARENT: found family 2358 for process 2406 ./ProcLog:03/26/18 11:08:06 : method PARENT: found family 2358 for process 2406 (already determined) ./ProcLog:03/26/18 11:08:06 : new subfamily registered: root = 2406, watcher = 2358 ./SchedLog:03/26/18 11:08:06 (pid:2405) SharedPortEndpoint: waiting for connections to named socket 2358_f868_3 ./SchedLog:03/26/18 11:08:06 (pid:2405) DaemonCore: command socket at <131.169.223.234:9620?addrs=131.169.223.234-9620+[2001-638-700-10df--1-ea]-9620&noUDP&sock=2358_f868_3> ./SchedLog:03/26/18 11:08:06 (pid:2405) DaemonCore: private command socket at <131.169.223.234:9620?addrs=131.169.223.234-9620+[2001-638-700-10df--1-ea]-9620&noUDP&sock=2358_f868_3>
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature