Hi all,while testing condor 23.9.6+dfsg-2.1 on fresh Debian trixie installs I noticed two problems:
(a)On a freshly installed machine the daemon starts up before our config management fully configured the daemon. That's something I still try to fix, but I don't think, this is the root of the problem unless this initial start-up saves its state really deep.
On first start-up it seems to use the first IP address it finds, e.g. from SharedPortLog:
MyAddress = "<172.23.33.3:9618?addrs=172.23.33.3-9618&alias=a3303&noUDP>" Shortly after, when our config management finally sets NETWORK_INTERFACE = 10.10.33.3 and reloads the service it kind of changes correctly: 08/12/25 12:31:38 Got SIGHUP. Re-reading config files. 08/12/25 12:31:38 main_config() called08/12/25 12:31:38 About to update statistics in shared_port daemon ad file at /var/lock//shared_port_ad :
ForkedChildrenCurrent = 0 ForkedChildrenPeak = 0 MyAddress = "<172.23.33.3:9618?addrs=10.10.33.3-9618&alias=a3303&noUDP>" [...] Although I think I would rather see 10.10.33.3 there. Sometimes, errors like08/12/25 13:05:10 attempt to connect to <172.23.33.3:0> failed: Connection refused (connect errno = 111). 08/12/25 13:05:10 ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <172.23.33.3:0> (try 1 of 3): SECMAN:2003:TCP connection to daemon at <172.23.33.3:0> failed.
come up but in this state, it mostly the EP seems to work - unless (b) happens
A hard service restart would always yield the expectedMyAddress = "<10.10.33.3:9618?addrs=10.10.33.3-9618&alias=a3303.atlas.local&noUDP>"
but as this would kill running jobs at later times, I would like to continue defaulting to service reloads if possible.
(b)Sometimes too many errors seem to accumulate and the SharedPort daemon is restarted (logs are from a different node named a4611 with 172.23.46.11 being the first and unwanted interface address and 10.10.46.11 the wanted):
08/12/25 10:36:40 attempt to connect to <172.23.46.11:0> failed: Connection refused (connect errno = 111). 08/12/25 10:36:40 ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <172.23.46.11:0> (try 2 of 3): SECMAN:2003:TCP connection to daemon at <172.23.46.11:0> failed.|SECMAN:2003:TCP connection to daemon at <172.23.46.11:0> failed. 08/12/25 10:36:45 attempt to connect to <172.23.46.11:0> failed: Connection refused (connect errno = 111). 08/12/25 10:36:45 ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <172.23.46.11:0> (try 3 of 3): SECMAN:2003:TCP connection to daemon at <172.23.46.11:0> failed.|SECMAN:2003:TCP connection to daemon at <172.23.46.11:0> failed.|SECMAN:2003:TCP connection to daemon at <172.23.46.11:0> failed.
08/12/25 10:38:29 Setting maximum file descriptors to 20000. 08/12/25 10:38:29 ****************************************************** 08/12/25 10:38:29 ** condor_shared_port (CONDOR_SHARED_PORT) STARTING UP 08/12/25 10:38:29 ** /usr/libexec/condor/condor_shared_port08/12/25 10:38:29 ** SubsystemInfo: name=SHARED_PORT type=SHARED_PORT(10) class=DAEMON(1) 08/12/25 10:38:29 ** Configuration: subsystem:SHARED_PORT local:<NONE> class:DAEMON 08/12/25 10:38:29 ** $CondorVersion: 23.9.6 2025-06-20 PackageID: 23.9.6+dfsg-2.1 $
08/12/25 10:38:29 ** $CondorPlatform: X86_64-Debian_13 $ 08/12/25 10:38:29 ** PID = 51567 RealUID = 0 08/12/25 10:38:29 ** Log last touched 8/12 10:37:41 08/12/25 10:38:29 ****************************************************** 08/12/25 10:38:29 Using config source: /etc/condor/condor_config 08/12/25 10:38:29 Using local config sources: 08/12/25 10:38:29 /etc/condor/config.d/00-htcondor-9.0.config 08/12/25 10:38:29 /etc/condor/config.d/01_generic 08/12/25 10:38:29 /etc/condor/config.d/10_EXECUTE 08/12/25 10:38:29 /etc/condor/config.d/20_CONTAINER 08/12/25 10:38:29 /etc/condor/condor_config.local08/12/25 10:38:29 config Macros = 110, Sorted = 110, StringBytes = 3191, TablesBytes = 4032
08/12/25 10:38:29 CLASSAD_CACHING is ENABLED 08/12/25 10:38:29 Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS 08/12/25 10:38:29 Daemoncore: Listening at <0.0.0.0:9618> on TCP (ReliSock).08/12/25 10:38:29 DaemonCore: command socket at <10.10.46.11:9618?addrs=10.10.46.11-9618&alias=a4611.atlas.local&noUDP> 08/12/25 10:38:29 DaemonCore: private command socket at <10.10.46.11:9618?addrs=10.10.46.11-9618&alias=a4611.atlas.local>
08/12/25 10:38:29 main_init() called08/12/25 10:38:29 About to update statistics in shared_port daemon ad file at /var/lock//shared_port_ad :
ForkedChildrenCurrent = 0 ForkedChildrenPeak = 0MyAddress = "<10.10.46.11:9618?addrs=10.10.46.11-9618&alias=a4611.atlas.local&noUDP>"
[...] sometimes ending in08/12/25 12:48:24 ERROR "FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT <172.23.46.11:0?alias=a4611&sock=master_8648_631d>" at line 226 in file ./src/condor_daemon_core.V6/daemon_keep_alive.cpp
and the master then restarting the daemon after a one hour timeout :( In principle, I've got a couple questions: (1) why is this happening and how do I fix this? (2) why does it always seem to fall back to the 172 address?(3) why does it get a connection refused? There is no packet filter installed and the daemon is listening on 0.0.0.0:9618 :
tcp LISTEN 0 4096 0.0.0.0:9618 0.0.0.0:* users:(("condor_shared_p",pid=26034,fd=10)) uid:666 ino:134484 sk:1001 cgroup:/system.slice/condor.service/leaf
Thanks a lot in advance for any pointers what I'm not understanding correctly.
Cheers CarstenPS: Praises to Tim who seems to have already started on trixie packaging efforts! pelican client seems to work already!
-- Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics, CallinstraÃe 38, 30167 Hannover, Germany, Phone +49 511 762 17185
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature