[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] multi-homes execute point drops sometimes from pool



Hi all,

while testing condor 23.9.6+dfsg-2.1 on fresh Debian trixie installs I noticed two problems:

(a)

On a freshly installed machine the daemon starts up before our config management fully configured the daemon. That's something I still try to fix, but I don't think, this is the root of the problem unless this initial start-up saves its state really deep.

On first start-up it seems to use the first IP address it finds, e.g. from SharedPortLog:

MyAddress = "<172.23.33.3:9618?addrs=172.23.33.3-9618&alias=a3303&noUDP>"

Shortly after, when our config management finally sets

NETWORK_INTERFACE = 10.10.33.3

and reloads the service it kind of changes correctly:

08/12/25 12:31:38 Got SIGHUP.  Re-reading config files.
08/12/25 12:31:38 main_config() called
08/12/25 12:31:38 About to update statistics in shared_port daemon ad file at /var/lock//shared_port_ad :
ForkedChildrenCurrent = 0
ForkedChildrenPeak = 0
MyAddress = "<172.23.33.3:9618?addrs=10.10.33.3-9618&alias=a3303&noUDP>"
[...]

Although I think I would rather see 10.10.33.3 there.

Sometimes, errors like

08/12/25 13:05:10 attempt to connect to <172.23.33.3:0> failed: Connection refused (connect errno = 111). 08/12/25 13:05:10 ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <172.23.33.3:0> (try 1 of 3): SECMAN:2003:TCP connection to daemon at <172.23.33.3:0> failed.

come up but in this state, it mostly the EP seems to work - unless (b) happens

A hard service restart would always yield the expected

MyAddress = "<10.10.33.3:9618?addrs=10.10.33.3-9618&alias=a3303.atlas.local&noUDP>"

but as this would kill running jobs at later times, I would like to continue defaulting to service reloads if possible.

(b)

Sometimes too many errors seem to accumulate and the SharedPort daemon is restarted (logs are from a different node named a4611 with 172.23.46.11 being the first and unwanted interface address and 10.10.46.11 the wanted):

08/12/25 10:36:40 attempt to connect to <172.23.46.11:0> failed: Connection refused (connect errno = 111). 08/12/25 10:36:40 ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <172.23.46.11:0> (try 2 of 3): SECMAN:2003:TCP connection to daemon at <172.23.46.11:0> failed.|SECMAN:2003:TCP connection to daemon at <172.23.46.11:0> failed. 08/12/25 10:36:45 attempt to connect to <172.23.46.11:0> failed: Connection refused (connect errno = 111). 08/12/25 10:36:45 ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <172.23.46.11:0> (try 3 of 3): SECMAN:2003:TCP connection to daemon at <172.23.46.11:0> failed.|SECMAN:2003:TCP connection to daemon at <172.23.46.11:0> failed.|SECMAN:2003:TCP connection to daemon at <172.23.46.11:0> failed.
08/12/25 10:38:29 Setting maximum file descriptors to 20000.
08/12/25 10:38:29 ******************************************************
08/12/25 10:38:29 ** condor_shared_port (CONDOR_SHARED_PORT) STARTING UP
08/12/25 10:38:29 ** /usr/libexec/condor/condor_shared_port
08/12/25 10:38:29 ** SubsystemInfo: name=SHARED_PORT type=SHARED_PORT(10) class=DAEMON(1) 08/12/25 10:38:29 ** Configuration: subsystem:SHARED_PORT local:<NONE> class:DAEMON 08/12/25 10:38:29 ** $CondorVersion: 23.9.6 2025-06-20 PackageID: 23.9.6+dfsg-2.1 $
08/12/25 10:38:29 ** $CondorPlatform: X86_64-Debian_13 $
08/12/25 10:38:29 ** PID = 51567 RealUID = 0
08/12/25 10:38:29 ** Log last touched 8/12 10:37:41
08/12/25 10:38:29 ******************************************************
08/12/25 10:38:29 Using config source: /etc/condor/condor_config
08/12/25 10:38:29 Using local config sources:
08/12/25 10:38:29    /etc/condor/config.d/00-htcondor-9.0.config
08/12/25 10:38:29    /etc/condor/config.d/01_generic
08/12/25 10:38:29    /etc/condor/config.d/10_EXECUTE
08/12/25 10:38:29    /etc/condor/config.d/20_CONTAINER
08/12/25 10:38:29    /etc/condor/condor_config.local
08/12/25 10:38:29 config Macros = 110, Sorted = 110, StringBytes = 3191, TablesBytes = 4032
08/12/25 10:38:29 CLASSAD_CACHING is ENABLED
08/12/25 10:38:29 Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS
08/12/25 10:38:29 Daemoncore: Listening at <0.0.0.0:9618> on TCP (ReliSock).
08/12/25 10:38:29 DaemonCore: command socket at <10.10.46.11:9618?addrs=10.10.46.11-9618&alias=a4611.atlas.local&noUDP> 08/12/25 10:38:29 DaemonCore: private command socket at <10.10.46.11:9618?addrs=10.10.46.11-9618&alias=a4611.atlas.local>
08/12/25 10:38:29 main_init() called
08/12/25 10:38:29 About to update statistics in shared_port daemon ad file at /var/lock//shared_port_ad :
ForkedChildrenCurrent = 0
ForkedChildrenPeak = 0
MyAddress = "<10.10.46.11:9618?addrs=10.10.46.11-9618&alias=a4611.atlas.local&noUDP>"

[...]

sometimes ending in

08/12/25 12:48:24 ERROR "FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT <172.23.46.11:0?alias=a4611&sock=master_8648_631d>" at line 226 in file ./src/condor_daemon_core.V6/daemon_keep_alive.cpp

and the master then restarting the daemon after a one hour timeout :(

In principle, I've got a couple questions:

(1) why is this happening and how do I fix this?

(2) why does it always seem to fall back to the 172 address?

(3) why does it get a connection refused? There is no packet filter installed and the daemon is listening on 0.0.0.0:9618 :

tcp LISTEN 0 4096 0.0.0.0:9618 0.0.0.0:* users:(("condor_shared_p",pid=26034,fd=10)) uid:666 ino:134484 sk:1001 cgroup:/system.slice/condor.service/leaf

Thanks a lot in advance for any pointers what I'm not understanding correctly.

Cheers

Carsten

PS: Praises to Tim who seems to have already started on trixie packaging efforts! pelican client seems to work already!

--
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany, Phone +49 511 762 17185

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature