Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] multi-homes execute point drops sometimes from pool

Date: Tue, 12 Aug 2025 15:26:45 +0200
From: Carsten Aulbert <carsten.aulbert@xxxxxxxxxx>
Subject: [HTCondor-users] multi-homes execute point drops sometimes from pool

Hi all,

while testing condor 23.9.6+dfsg-2.1 on fresh Debian trixie installs Inoticed two problems:

(a)

On a freshly installed machine the daemon starts up before our configmanagement fully configured the daemon. That's something I still try tofix, but I don't think, this is the root of the problem unless thisinitial start-up saves its state really deep.

On first start-up it seems to use the first IP address it finds, e.g.from SharedPortLog:


MyAddress = "<172.23.33.3:9618?addrs=172.23.33.3-9618&alias=a3303&noUDP>"

Shortly after, when our config management finally sets

NETWORK_INTERFACE = 10.10.33.3

and reloads the service it kind of changes correctly:

08/12/25 12:31:38 Got SIGHUP.  Re-reading config files.
08/12/25 12:31:38 main_config() called

08/12/25 12:31:38 About to update statistics in shared_port daemon adfile at /var/lock//shared_port_ad :

ForkedChildrenCurrent = 0
ForkedChildrenPeak = 0
MyAddress = "<172.23.33.3:9618?addrs=10.10.33.3-9618&alias=a3303&noUDP>"
[...]

Although I think I would rather see 10.10.33.3 there.

Sometimes, errors like

08/12/25 13:05:10 attempt to connect to <172.23.33.3:0> failed:Connection refused (connect errno = 111).08/12/25 13:05:10 ChildAliveMsg: failed to send DC_CHILDALIVE to parentdaemon at <172.23.33.3:0> (try 1 of 3): SECMAN:2003:TCP connection todaemon at <172.23.33.3:0> failed.

come up but in this state, it mostly the EP seems to work - unless (b)happens


A hard service restart would always yield the expected

MyAddress ="<10.10.33.3:9618?addrs=10.10.33.3-9618&alias=a3303.atlas.local&noUDP>"

but as this would kill running jobs at later times, I would like tocontinue defaulting to service reloads if possible.

(b)

Sometimes too many errors seem to accumulate and the SharedPort daemonis restarted (logs are from a different node named a4611 with172.23.46.11 being the first and unwanted interface address and10.10.46.11 the wanted):

08/12/25 10:36:40 attempt to connect to <172.23.46.11:0> failed:Connection refused (connect errno = 111).08/12/25 10:36:40 ChildAliveMsg: failed to send DC_CHILDALIVE to parentdaemon at <172.23.46.11:0> (try 2 of 3): SECMAN:2003:TCP connection todaemon at <172.23.46.11:0> failed.|SECMAN:2003:TCP connection to daemonat <172.23.46.11:0> failed.08/12/25 10:36:45 attempt to connect to <172.23.46.11:0> failed:Connection refused (connect errno = 111).08/12/25 10:36:45 ChildAliveMsg: failed to send DC_CHILDALIVE to parentdaemon at <172.23.46.11:0> (try 3 of 3): SECMAN:2003:TCP connection todaemon at <172.23.46.11:0> failed.|SECMAN:2003:TCP connection to daemonat <172.23.46.11:0> failed.|SECMAN:2003:TCP connection to daemon at<172.23.46.11:0> failed.

08/12/25 10:38:29 Setting maximum file descriptors to 20000.
08/12/25 10:38:29 ******************************************************
08/12/25 10:38:29 ** condor_shared_port (CONDOR_SHARED_PORT) STARTING UP
08/12/25 10:38:29 ** /usr/libexec/condor/condor_shared_port

08/12/25 10:38:29 ** SubsystemInfo: name=SHARED_PORTtype=SHARED_PORT(10) class=DAEMON(1)08/12/25 10:38:29 ** Configuration: subsystem:SHARED_PORT local:<NONE>class:DAEMON08/12/25 10:38:29 ** $CondorVersion: 23.9.6 2025-06-20 PackageID:23.9.6+dfsg-2.1 $

08/12/25 10:38:29 ** $CondorPlatform: X86_64-Debian_13 $
08/12/25 10:38:29 ** PID = 51567 RealUID = 0
08/12/25 10:38:29 ** Log last touched 8/12 10:37:41
08/12/25 10:38:29 ******************************************************
08/12/25 10:38:29 Using config source: /etc/condor/condor_config
08/12/25 10:38:29 Using local config sources:
08/12/25 10:38:29    /etc/condor/config.d/00-htcondor-9.0.config
08/12/25 10:38:29    /etc/condor/config.d/01_generic
08/12/25 10:38:29    /etc/condor/config.d/10_EXECUTE
08/12/25 10:38:29    /etc/condor/config.d/20_CONTAINER
08/12/25 10:38:29    /etc/condor/condor_config.local

08/12/25 10:38:29 config Macros = 110, Sorted = 110, StringBytes = 3191,TablesBytes = 4032

08/12/25 10:38:29 CLASSAD_CACHING is ENABLED
08/12/25 10:38:29 Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS
08/12/25 10:38:29 Daemoncore: Listening at <0.0.0.0:9618> on TCP (ReliSock).

08/12/25 10:38:29 DaemonCore: command socket at<10.10.46.11:9618?addrs=10.10.46.11-9618&alias=a4611.atlas.local&noUDP>08/12/25 10:38:29 DaemonCore: private command socket at<10.10.46.11:9618?addrs=10.10.46.11-9618&alias=a4611.atlas.local>

08/12/25 10:38:29 main_init() called

08/12/25 10:38:29 About to update statistics in shared_port daemon adfile at /var/lock//shared_port_ad :

ForkedChildrenCurrent = 0
ForkedChildrenPeak = 0

MyAddress ="<10.10.46.11:9618?addrs=10.10.46.11-9618&alias=a4611.atlas.local&noUDP>"


[...]

sometimes ending in

08/12/25 12:48:24 ERROR "FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT<172.23.46.11:0?alias=a4611&sock=master_8648_631d>" at line 226 in file./src/condor_daemon_core.V6/daemon_keep_alive.cpp


and the master then restarting the daemon after a one hour timeout :(

In principle, I've got a couple questions:

(1) why is this happening and how do I fix this?

(2) why does it always seem to fall back to the 172 address?

(3) why does it get a connection refused? There is no packet filterinstalled and the daemon is listening on 0.0.0.0:9618 :

tcp LISTEN 0 4096 0.0.0.0:96180.0.0.0:* users:(("condor_shared_p",pid=26034,fd=10))uid:666 ino:134484 sk:1001 cgroup:/system.slice/condor.service/leaf

Thanks a lot in advance for any pointers what I'm not understandingcorrectly.


Cheers

Carsten

PS: Praises to Tim who seems to have already started on trixie packagingefforts! pelican client seems to work already!


--
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany, Phone +49 511 762 17185

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Follow-Ups:
- Re: [HTCondor-users] multi-homes execute point drops sometimes from pool
  - From: Todd L Miller

Prev by Date: Re: [HTCondor-users] Unexpected CumulativeSuspensionTime > RemoteWallclockTime in HTCondor 24.0.7
Next by Date: [HTCondor-users] Getting shadow exceptions in htcondor 24
Previous by thread: Re: [HTCondor-users] Unexpected CumulativeSuspensionTime > RemoteWallclockTime in HTCondor 24.0.7
Next by thread: Re: [HTCondor-users] multi-homes execute point drops sometimes from pool
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[HTCondor-users] multi-homes execute point drops sometimes from pool