Hi, Condor is failing to restart cleanly after node reboots on or dual stack (IPv4 and IPv6) nodes. The issue appears to be the communication from the Shared Port Daemon back to the Master that started it. If I run `sudo systemctl restart condor` after I can log into the node everything comes up cleanly so I’m wondering if the Master is coming up before something that it needs. Extracts of the MasterLog and SharedPortLog are below. This is with condor 8.8.13 on CentOS7. Has anyone seen anything like this and/or know of a fix? I’m wondering of the first line of the MasterLog extract is significant. Thanks, Chris. ## MasterLog 06/08/21 10:37:49 init_local_hostname_impl: ipv6_getaddrinfo() returned EAI_AGAIN for 'heplnc001.pp.rl.ac.uk'. Will try again after sleeping 3 seconds (try 2 of 20). 06/08/21 10:37:49 ****************************************************** 06/08/21 10:37:49 ** condor_master (CONDOR_MASTER) STARTING UP 06/08/21 10:37:49 ** /usr/sbin/condor_master 06/08/21 10:37:49 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1) 06/08/21 10:37:49 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON 06/08/21 10:37:49 ** $CondorVersion: 8.8.13 Mar 23 2021 BuildID: 534541 PackageID: 8.8.13-1 $ 06/08/21 10:37:49 ** $CondorPlatform: x86_64_CentOS7 $ 06/08/21 10:37:49 ** PID = 1677 06/08/21 10:37:49 ** Log last touched 6/8 10:35:24 06/08/21 10:37:49 ****************************************************** 06/08/21 10:37:49 Using config source: /etc/condor/condor_config 06/08/21 10:37:49 Using local config sources: 06/08/21 10:37:49 /etc/condor/config.d/00init.config 06/08/21 10:37:49 /etc/condor/config.d/01puppet_ssl.config 06/08/21 10:37:49 /etc/condor/config.d/02machines.config 06/08/21 10:37:49 /etc/condor/config.d/05security.config 06/08/21 10:37:49 /etc/condor/config.d/20wn_centos7.config 06/08/21 10:37:49 /etc/condor/config.d/25scaling.config 06/08/21 10:37:49 /etc/condor/config.d/27healthcheck.config 06/08/21 10:37:49 /etc/condor/config.d/28rebooter.config 06/08/21 10:37:49 /etc/condor/config.d/29start.config 06/08/21 10:37:49 /etc/condor/config.d/30start_jobtypes.config 06/08/21 10:37:49 /etc/condor/config.d/30start_multicore.config 06/08/21 10:37:49 /etc/condor/config.d/41shared_port.config 06/08/21 10:37:49 /etc/condor/condor_config.local 06/08/21 10:37:49 config Macros = 159, Sorted = 159, StringBytes = 7323, TablesBytes = 5868 06/08/21 10:37:49 CLASSAD_CACHING is OFF 06/08/21 10:37:49 Daemon Log is logging: D_ALWAYS D_ERROR 06/08/21 10:37:50 SharedPortEndpoint: waiting for connections to named socket 1677_dc69 06/08/21 10:37:50 SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory 06/08/21 10:37:50 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s. 06/08/21 10:37:50 DaemonCore: private command socket at <[2001:630:58:1c20::82f6:2d01]:0?sock=1677_dc69> 06/08/21 10:37:50 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1616514849) 06/08/21 10:37:50 Starting shared port with port: 9618 06/08/21 10:37:50 Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 1852 06/08/21 10:37:50 Waiting for /var/lock/condor/shared_port_ad to appear. 06/08/21 10:37:50 DefaultReaper unexpectedly called on pid 1852, status 1024. 06/08/21 10:37:50 The SHARED_PORT (pid 1852) exited with status 4 06/08/21 10:37:50 Sending obituary for "/usr/libexec/condor/condor_shared_port" 06/08/21 10:37:50 restarting /usr/libexec/condor/condor_shared_port in 10 seconds ## SharedPortLog 06/08/21 10:40:09 Setting maximum file descriptors to 4096. 06/08/21 10:40:09 ****************************************************** 06/08/21 10:40:09 ** condor_shared_port (CONDOR_SHARED_PORT) STARTING UP 06/08/21 10:40:09 ** /usr/libexec/condor/condor_shared_port 06/08/21 10:40:09 ** SubsystemInfo: name=SHARED_PORT type=SHARED_PORT(11) class=DAEMON(1) 06/08/21 10:40:09 ** Configuration: subsystem:SHARED_PORT local:<NONE> class:DAEMON 06/08/21 10:40:09 ** $CondorVersion: 8.8.13 Mar 23 2021 BuildID: 534541 PackageID: 8.8.13-1 $ 06/08/21 10:40:09 ** $CondorPlatform: x86_64_CentOS7 $ 06/08/21 10:40:09 ** PID = 2997 06/08/21 10:40:09 ** Log last touched 6/8 10:40:08 06/08/21 10:40:09 ****************************************************** 06/08/21 10:40:09 Using config source: /etc/condor/condor_config 06/08/21 10:40:09 Using local config sources: 06/08/21 10:40:09 /etc/condor/config.d/00init.config 06/08/21 10:40:09 /etc/condor/config.d/01puppet_ssl.config 06/08/21 10:40:09 /etc/condor/config.d/02machines.config 06/08/21 10:40:09 /etc/condor/config.d/05security.config 06/08/21 10:40:09 /etc/condor/config.d/20wn_centos7.config 06/08/21 10:40:09 /etc/condor/config.d/25scaling.config 06/08/21 10:40:09 /etc/condor/config.d/27healthcheck.config 06/08/21 10:40:09 /etc/condor/config.d/28rebooter.config 06/08/21 10:40:09 /etc/condor/config.d/29start.config 06/08/21 10:40:09 /etc/condor/config.d/30start_jobtypes.config 06/08/21 10:40:09 /etc/condor/config.d/30start_multicore.config 06/08/21 10:40:09 /etc/condor/config.d/41shared_port.config 06/08/21 10:40:09 /etc/condor/condor_config.local 06/08/21 10:40:09 config Macros = 161, Sorted = 161, StringBytes = 7389, TablesBytes = 5940 06/08/21 10:40:09 CLASSAD_CACHING is ENABLED 06/08/21 10:40:09 Daemon Log is logging: D_ALWAYS D_ERROR 06/08/21 10:40:09 Daemoncore: Listening at <[::]:9618> on TCP (ReliSock). 06/08/21 10:40:09 DaemonCore: command socket at <[2001:630:58:1c20::82f6:2d01]:9618?addrs=[2001-630-58-1c20--82f6-2d01]-9618&noUDP> 06/08/21 10:40:09 DaemonCore: private command socket at <[2001:630:58:1c20::82f6:2d01]:9618?addrs=[2001-630-58-1c20--82f6-2d01]-9618> 06/08/21 10:40:09 main_init() called 06/08/21 10:40:09 About to update statistics in shared_port daemon ad file at /var/lock/condor/shared_port_ad : ForkedChildrenPeak = 0 RequestsBlocked = 0 ForkedChildrenCurrent = 0 RequestsSucceeded = 0 RequestsPendingPeak = 0 RequestsPendingCurrent = 0 RequestsFailed = 0 SharedPortCommandSinfuls = "<[2001:630:58:1c20::82f6:2d01]:9618>" MyAddress = "<[2001:630:58:1c20::82f6:2d01]:9618?addrs=[2001-630-58-1c20--82f6-2d01]-9618&noUDP>" 06/08/21 10:40:09 attempt to connect to <[2001:630:58:1c20::82f6:2d01]:0> failed: Connection refused (connect errno = 111). 06/08/21 10:40:09 ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <[2001:630:58:1c20::82f6:2d01]:0> (try 1 of 3): CEDAR:6001:Failed to connect to <[2001:630:58:1c20::82f6:2d01]:0?sock=1677_dc69> 06/08/21 10:40:09 attempt to connect to <[2001:630:58:1c20::82f6:2d01]:0> failed: Connection refused (connect errno = 111). 06/08/21 10:40:09 ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <[2001:630:58:1c20::82f6:2d01]:0> (try 2 of 3): CEDAR:6001:Failed to connect to <[2001:630:58:1c20::82f6:2d01]:0?sock=1677_dc69>|CEDAR:6001:Failed
to connect to <[2001:630:58:1c20::82f6:2d01]:0?sock=1677_dc69> 06/08/21 10:40:09 attempt to connect to <[2001:630:58:1c20::82f6:2d01]:0> failed: Connection refused (connect errno = 111). 06/08/21 10:40:09 ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <[2001:630:58:1c20::82f6:2d01]:0> (try 3 of 3): CEDAR:6001:Failed to connect to <[2001:630:58:1c20::82f6:2d01]:0?sock=1677_dc69>|CEDAR:6001:Failed
to connect to <[2001:630:58:1c20::82f6:2d01]:0?sock=1677_dc69>|CEDAR:6001:Failed to connect to <[2001:630:58:1c20::82f6:2d01]:0?sock=1677_dc69> 06/08/21 10:40:09 ERROR "FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT <[2001:630:58:1c20::82f6:2d01]:0?sock=1677_dc69>" at line 241 in file /var/lib/condor/execute/slot1/dir_12537/userdir/.tmpfVvlO6/BUILD/condor-8.8.13/src/condor_daemon_core.V6/daemon_keep_alive.cpp This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. Opinions, conclusions or other information in this message and attachments that are not related directly to UKRI business are solely those of the author and do not represent the views of UKRI. |