Hi Todd, thanks for the info! Judging from the periodicity in the journal, it looks pretty much that it could be the bug [1] ;) In comparison to the recently spawned sched with condor-8.6.8-2, I do not see the the issue on a sibling with condor-8.6.8-1. However, the working sched as has recently updated systemd packages to systemd-*-219-42.el7_4.10 as well - but I have not rebooted it since to pick it up... So might be also some dependency on systemd versions convoluted?? Anyway, I updated the machine to 8.6.10-1 and will keep an eye on it ;) Cheers and thanks, Thomas [1] Mar 26 11:11:12 grid-vm08.desy.de systemd[1]: Started Condor Distributed High-Throughput-Computing. Mar 26 11:11:12 grid-vm08.desy.de systemd[1]: Starting Condor Distributed High-Throughput-Computing... Mar 26 11:38:54 grid-vm08.desy.de systemd[1]: Stopping Condor Distributed High-Throughput-Computing... Mar 26 11:38:54 grid-vm08.desy.de systemd[1]: Stopped Condor Distributed High-Throughput-Computing. Mar 26 12:11:22 grid-vm08.desy.de systemd[1]: Started Condor Distributed High-Throughput-Computing. Mar 26 12:11:22 grid-vm08.desy.de systemd[1]: Starting Condor Distributed High-Throughput-Computing... Mar 26 12:38:56 grid-vm08.desy.de systemd[1]: Stopping Condor Distributed High-Throughput-Computing... Mar 26 12:38:56 grid-vm08.desy.de systemd[1]: Stopped Condor Distributed High-Throughput-Computing. On 2018-03-26 15:26, Todd Tannenbaum wrote: > > > On Mar 26, 2018, at 6:57 AM, Thomas Hartmann <thomas.hartmann@xxxxxxx > <mailto:thomas.hartmann@xxxxxxx>> wrote: > >> Hi all, >> >> I have spawned a fresh scheduler [1] whose daemons seem to always >> getting killed shortly after they got created during a reboot. AFAIS, >> the daemons [2,3] get a SIGQUIT from the daemon core [4] - however, I do >> not get, why it triggered the actual shutdown [5]. >> >> After manually (re)starting condor's service, the daemons are running >> stable, so I wonder, why they got killed reproducible after their first >> start following reboots? >> >> Cheers and thanks for ideas, >> ÂThomas > > Hi Thomas, > > Something is sending the condor_master a SIGQUIT signal, which results > in the master shutting down everything. > > I wonder if you are being hit by this bug which was fixed in HTCondor > v8.6.9: > >  ÂÂhttps://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6476 > > In v8.6.8 and earlier, systemd would send a sigquit to the master 20 > minutes (by default) after either a condor_restart or after the > condor_master binary was touched/changed.  To confirm It would be > useful to see more of your MasterLog, esp for 25 minutes before it > receives the SIGQUIT. And/or check your systemd logs. Or just upgrade > and see if it goes away :) > > Best regards, > Todd > > >> >> [1] >> condor-external-libs-8.6.8-2.el7.x86_64 >> condor-python-8.6.8-2.el7.x86_64 >> condor-8.6.8-2.el7.x86_64 >> condor-classads-8.6.8-2.el7.x86_64 >> condor-procd-8.6.8-2.el7.x86_64 >> >> [2] >>> Master aka PID:2358 >>>> MasterLog >> ... >> 03/26/18 11:08:06 Started DaemonCore process >> "/usr/libexec/condor/condor_defrag", pid and pgroup = 2406 >> 03/26/18 11:08:47 Got SIGQUIT. ÂPerforming fast shutdown. >> 03/26/18 11:08:47 Sent SIGQUIT to DEFRAG (pid 2406) >> 03/26/18 11:08:47 Sent SIGQUIT to SCHEDD (pid 2405) >> 03/26/18 11:08:47 AllReaper unexpectedly called on pid 2405, status 0. >> 03/26/18 11:08:47 The SCHEDD (pid 2405) exited with status 0 >> 03/26/18 11:08:47 AllReaper unexpectedly called on pid 2406, status 0. >> 03/26/18 11:08:47 The DEFRAG (pid 2406) exited with status 0 >> 03/26/18 11:08:47 Sent SIGTERM to SHARED_PORT (pid 2398) >> 03/26/18 11:08:47 AllReaper unexpectedly called on pid 2398, status 0. >> 03/26/18 11:08:47 The SHARED_PORT (pid 2398) exited with status 0 >> 03/26/18 11:08:47 All daemons are gone. ÂExiting. >> 03/26/18 11:08:47 **** condor_master (condor_MASTER) pid 2358 EXITING >> WITH STATUS 0 >> >> [3] >> Sched aka PID:2405 >>>> SchedLog >> ... >> 03/26/18 11:08:12 (pid:2405) TransferQueueManager download 1m I/O load: >> 0 bytes/s Â0.000 disk load Â0.000 net load >> 03/26/18 11:08:47 (pid:2405) Got SIGQUIT. ÂPerforming fast shutdown. >> 03/26/18 11:08:47 (pid:2405) Cron: Killing all jobs >> 03/26/18 11:08:47 (pid:2405) All shadows have been killed, exiting. >> 03/26/18 11:08:47 (pid:2405) **** condor_schedd (condor_SCHEDD) pid 2405 >> EXITING WITH STATUS 0 >> 03/26/18 11:08:47 (pid:2405) Cron: Killing all jobs >> 03/26/18 11:08:47 (pid:2405) CronJobList: Deleting all jobs >> 03/26/18 11:08:47 (pid:2405) Cron: Killing all jobs >> 03/26/18 11:08:47 (pid:2405) CronJobList: Deleting all jobs >> >> [4] >>> ProcLog >> ... >> 03/26/18 11:08:05 : no methods have determined process 2131 to be in a >> monitored family >> 03/26/18 11:08:05 : ...snapshot complete >> 03/26/18 11:08:05 : PROC_FAMILY_REGISTER_SUBFAMILY >> 03/26/18 11:08:05 : taking a snapshot... >> 03/26/18 11:08:05 : method PARENT: found family 2358 for process 2398 >> 03/26/18 11:08:05 : method PARENT: found family 2358 for process 2398 >> (already determined) >> 03/26/18 11:08:05 : ...snapshot complete >> 03/26/18 11:08:05 : moving process 2398 into new subfamily 2398 >> 03/26/18 11:08:05 : new subfamily registered: root = 2398, watcher = 2358 >> 03/26/18 11:08:05 : PROC_FAMILY_TRACK_FAMILY_VIA_ENVIRONMENT >> 03/26/18 11:08:06 : PROC_FAMILY_REGISTER_SUBFAMILY >> 03/26/18 11:08:06 : taking a snapshot... >> 03/26/18 11:08:06 : method PARENT: found family 2358 for process 2405 >> 03/26/18 11:08:06 : method PARENT: found family 2358 for process 2405 >> (already determined) >> ... >> 03/26/18 11:08:06 : PROC_FAMILY_TRACK_FAMILY_VIA_ENVIRONMENT >> 03/26/18 11:08:47 : PROC_FAMILY_KILL_FAMILY >> 03/26/18 11:08:47 : taking a snapshot... >> 03/26/18 11:08:47 : process 2406 (of family 2406) has exited >> 03/26/18 11:08:47 : process 2405 (of family 2405) has exited >> 03/26/18 11:08:47 : process 1982 (not in monitored family) has exited >> 03/26/18 11:08:47 : process 1763 (not in monitored family) has exited >> 03/26/18 11:08:47 : process 1738 (not in monitored family) has exited >> 03/26/18 11:08:47 : process 1310 (not in monitored family) has exited >> 03/26/18 11:08:47 : process 542 (not in monitored family) has exited >> 03/26/18 11:08:47 : no methods have determined process 2413 to be in a >> monitored family >> 03/26/18 11:08:47 : no methods have determined process 2416 to be in a >> monitored family >> 03/26/18 11:08:47 : no methods have determined process 2417 to be in a >> monitored family >> 03/26/18 11:08:47 : no methods have determined process 2697 to be in a >> monitored family >> 03/26/18 11:08:47 : no methods have determined process 2745 to be in a >> monitored family >> 03/26/18 11:08:47 : no methods have determined process 2964 to be in a >> monitored family >> 03/26/18 11:08:47 : no methods have determined process 3004 to be in a >> monitored family >> 03/26/18 11:08:47 : ...snapshot complete >> 03/26/18 11:08:47 : sending signal 9 to family with root 2405 >> 03/26/18 11:08:47 : PROC_FAMILY_UNREGISTER_FAMILY >> 03/26/18 11:08:47 : unregistering family with root pid 2405 >> 03/26/18 11:08:47 : PROC_FAMILY_KILL_FAMILY >> 03/26/18 11:08:47 : taking a snapshot... >> 03/26/18 11:08:47 : ...snapshot complete >> 03/26/18 11:08:47 : sending signal 9 to family with root 2406 >> 03/26/18 11:08:47 : PROC_FAMILY_UNREGISTER_FAMILY >> 03/26/18 11:08:47 : unregistering family with root pid 2406 >> 03/26/18 11:08:47 : PROC_FAMILY_KILL_FAMILY >> 03/26/18 11:08:47 : taking a snapshot... >> 03/26/18 11:08:47 : process 2398 (of family 2398) has exited >> 03/26/18 11:08:47 : ...snapshot complete >> 03/26/18 11:08:47 : sending signal 9 to family with root 2398 >> 03/26/18 11:08:47 : PROC_FAMILY_QUIT >> >> [5] >>> Sched aka PID:2405 >>>> grep "2405" ./* >> ./MasterLog:03/26/18 11:08:06 Started DaemonCore process >> "/usr/sbin/condor_schedd", pid and pgroup = 2405 >> ./MasterLog:03/26/18 11:08:47 Sent SIGQUIT to SCHEDD (pid 2405) >> ./MasterLog:03/26/18 11:08:47 AllReaper unexpectedly called on pid 2405, >> status 0. >> ./MasterLog:03/26/18 11:08:47 The SCHEDD (pid 2405) exited with status 0 >> ./ProcLog:03/26/18 11:08:06 : method PARENT: found family 2358 for >> process 2405 >> ./ProcLog:03/26/18 11:08:06 : method PARENT: found family 2358 for >> process 2405 (already determined) >> ./ProcLog:03/26/18 11:08:06 : moving process 2405 into new subfamily 2405 >> ./ProcLog:03/26/18 11:08:06 : new subfamily registered: root = 2405, >> watcher = 2358 >> ./ProcLog:03/26/18 11:08:47 : process 2405 (of family 2405) has exited >> ./ProcLog:03/26/18 11:08:47 : sending signal 9 to family with root 2405 >> ./ProcLog:03/26/18 11:08:47 : unregistering family with root pid 2405 >> ./SchedLog:03/26/18 11:08:06 (pid:2405) Setting maximum file descriptors >> to 4096. >> ./SchedLog:03/26/18 11:08:06 (pid:2405) >> ****************************************************** >> ./SchedLog:03/26/18 11:08:06 (pid:2405) ** condor_schedd (CONDOR_SCHEDD) >> STARTING UP >> ./SchedLog:03/26/18 11:08:06 (pid:2405) ** /usr/sbin/condor_schedd >> ./SchedLog:03/26/18 11:08:06 (pid:2405) ** SubsystemInfo: name=SCHEDD >> type=SCHEDD(5) class=DAEMON(1) >> ./SchedLog:03/26/18 11:08:06 (pid:2405) ** Configuration: >> subsystem:SCHEDD local:<NONE> class:DAEMON >> ./SchedLog:03/26/18 11:08:06 (pid:2405) ** $CondorVersion: 8.6.8 Nov 13 >> 2017 BuildID: 424045 $ >> ./SchedLog:03/26/18 11:08:06 (pid:2405) ** $CondorPlatform: >> x86_64_RedHat7 $ >> ./SchedLog:03/26/18 11:08:06 (pid:2405) ** PID = 2405 >> ./SchedLog:03/26/18 11:08:06 (pid:2405) ** Log last touched 3/26 11:06:38 >> ./SchedLog:03/26/18 11:08:06 (pid:2405) >> ****************************************************** >> ./SchedLog:03/26/18 11:08:06 (pid:2405) Using config source: >> /etc/condor/condor_config >> ./SchedLog:03/26/18 11:08:06 (pid:2405) Using local config sources: >> ./SchedLog:03/26/18 11:08:06 (pid:2405) >> /etc/condor/config.d/00arc_ce.conf >> ./SchedLog:03/26/18 11:08:06 (pid:2405) >> /etc/condor/config.d/02submitd.conf >> ./SchedLog:03/26/18 11:08:06 (pid:2405) >> /etc/condor/config.d/04defragd.conf >> ./SchedLog:03/26/18 11:08:06 (pid:2405) ÂÂÂ/etc/condor/condor_config.local >> ./SchedLog:03/26/18 11:08:06 (pid:2405) config Macros = 98, Sorted = 98, >> StringBytes = 4110, TablesBytes = 3600 >> ./SchedLog:03/26/18 11:08:06 (pid:2405) CLASSAD_CACHING is ENABLED >> ./SchedLog:03/26/18 11:08:06 (pid:2405) Daemon Log is logging: D_ALWAYS >> D_ERROR >> ./SchedLog:03/26/18 11:08:06 (pid:2405) SharedPortEndpoint: waiting for >> connections to named socket 2358_f868_3 >> ./SchedLog:03/26/18 11:08:06 (pid:2405) DaemonCore: command socket at >> <131.169.223.234:9620?addrs=131.169.223.234-9620+[2001-638-700-10df--1-ea]-9620&noUDP&sock=2358_f868_3> >> ./SchedLog:03/26/18 11:08:06 (pid:2405) DaemonCore: private command >> socket at >> <131.169.223.234:9620?addrs=131.169.223.234-9620+[2001-638-700-10df--1-ea]-9620&noUDP&sock=2358_f868_3> >> ./SchedLog:03/26/18 11:08:06 (pid:2405) History file rotation is enabled. >> ./SchedLog:03/26/18 11:08:06 (pid:2405) ÂÂMaximum history file size is: >> 50000000 bytes >> ./SchedLog:03/26/18 11:08:06 (pid:2405) ÂÂNumber of rotated history >> files is: 5 >> ./SchedLog:03/26/18 11:08:06 (pid:2405) my_popenv: Failed to exec in >> child, errno=2 (No such file or directory) >> ./SchedLog:03/26/18 11:08:06 (pid:2405) Failed to execute >> /usr/sbin/condor_shadow.std, ignoring >> ./SchedLog:03/26/18 11:08:12 (pid:2405) TransferQueueManager stats: >> active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s >> ./SchedLog:03/26/18 11:08:12 (pid:2405) TransferQueueManager upload 1m >> I/O load: 0 bytes/s Â0.000 disk load Â0.000 net load >> ./SchedLog:03/26/18 11:08:12 (pid:2405) TransferQueueManager download 1m >> I/O load: 0 bytes/s Â0.000 disk load Â0.000 net load >> ./SchedLog:03/26/18 11:08:47 (pid:2405) Got SIGQUIT. ÂPerforming fast >> shutdown. >> ./SchedLog:03/26/18 11:08:47 (pid:2405) Cron: Killing all jobs >> ./SchedLog:03/26/18 11:08:47 (pid:2405) All shadows have been killed, >> exiting. >> ./SchedLog:03/26/18 11:08:47 (pid:2405) **** condor_schedd >> (condor_SCHEDD) pid 2405 EXITING WITH STATUS 0 >> ./SchedLog:03/26/18 11:08:47 (pid:2405) Cron: Killing all jobs >> ./SchedLog:03/26/18 11:08:47 (pid:2405) CronJobList: Deleting all jobs >> ./SchedLog:03/26/18 11:08:47 (pid:2405) Cron: Killing all jobs >> ./SchedLog:03/26/18 11:08:47 (pid:2405) CronJobList: Deleting all jobs >> >> [6] >>> Master aka PID:2358 >>>> grep 2358 ./* >> ./MasterLog:03/26/18 11:08:04 ** PID = 2358 >> ./MasterLog:03/26/18 11:08:05 SharedPortEndpoint: waiting for >> connections to named socket 2358_f868 >> ./MasterLog:03/26/18 11:08:05 DaemonCore: private command socket at >> <131.169.223.234:0?sock=2358_f868> >> ./MasterLog:03/26/18 11:08:47 **** condor_master (condor_MASTER) pid >> 2358 EXITING WITH STATUS 0 >> ./ProcLog:03/26/18 11:08:05 : Procd has a watcher pid and will die if >> pid 2358 dies. >> ./ProcLog:03/26/18 11:08:05 : method PID: found family 2358 for >> process 2358 >> ./ProcLog:03/26/18 11:08:05 : method PARENT: found family 2358 for >> process 2397 >> ./ProcLog:03/26/18 11:08:05 : method PARENT: found family 2358 for >> process 2397 (already determined) >> ./ProcLog:03/26/18 11:08:05 : method PARENT: found family 2358 for >> process 2398 >> ./ProcLog:03/26/18 11:08:05 : method PARENT: found family 2358 for >> process 2398 (already determined) >> ./ProcLog:03/26/18 11:08:05 : new subfamily registered: root = 2398, >> watcher = 2358 >> ./ProcLog:03/26/18 11:08:06 : method PARENT: found family 2358 for >> process 2405 >> ./ProcLog:03/26/18 11:08:06 : method PARENT: found family 2358 for >> process 2405 (already determined) >> ./ProcLog:03/26/18 11:08:06 : new subfamily registered: root = 2405, >> watcher = 2358 >> ./ProcLog:03/26/18 11:08:06 : method PARENT: found family 2358 for >> process 2406 >> ./ProcLog:03/26/18 11:08:06 : method PARENT: found family 2358 for >> process 2406 (already determined) >> ./ProcLog:03/26/18 11:08:06 : new subfamily registered: root = 2406, >> watcher = 2358 >> ./SchedLog:03/26/18 11:08:06 (pid:2405) SharedPortEndpoint: waiting for >> connections to named socket 2358_f868_3 >> ./SchedLog:03/26/18 11:08:06 (pid:2405) DaemonCore: command socket at >> <131.169.223.234:9620?addrs=131.169.223.234-9620+[2001-638-700-10df--1-ea]-9620&noUDP&sock=2358_f868_3> >> ./SchedLog:03/26/18 11:08:06 (pid:2405) DaemonCore: private command >> socket at >> <131.169.223.234:9620?addrs=131.169.223.234-9620+[2001-638-700-10df--1-ea]-9620&noUDP&sock=2358_f868_3> >> >> >> >> _______________________________________________ >> HTCondor-users mailing list >> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx >> <mailto:htcondor-users-request@xxxxxxxxxxx> with a >> subject: Unsubscribe >> You can also unsubscribe by visiting >> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users >> >> The archives can be found at: >> https://lists.cs.wisc.edu/archive/htcondor-users/ > > > _______________________________________________ > HTCondor-users mailing list > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a > subject: Unsubscribe > You can also unsubscribe by visiting > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users > > The archives can be found at: > https://lists.cs.wisc.edu/archive/htcondor-users/ >
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature