[HTCondor-users] HAD replication and link-local addresses in 24.0

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Hi,

The third 24.0 CMs issue to report is that the HAD replication seems broken in our setup. This would indeed represent a problem.

The problem seems the one reported at https://opensciencegrid.atlassian.net/browse/HTCONDOR-2453, ie the link-local address of the name being got when resolving the local hostname. At least we do see the link-local address (fe80::...) shown in the logs, as described in the bug report. We don't see this for 23.0 CMs.

E.g. in HADLog of sleepybird03:

12/13/24 10:29:23 HADStateMachine::initializeHADList my address '<188.184.103.96:51450?addrs=188.184.103.96-51450+[2001-1458-d00-3b--100-30d]-51450&alias=sleepybird03.cern.ch>' vs. address in the list '<[fe80::f816:3eff:fe7f:e18d]:51450>'

And in MasterLog:

12/13/24 11:21:20 Started DaemonCore process "/usr/sbin/condor_replication", pid and pgroup = 409373

12/13/24 11:21:20 attempt to connect to <[fe80::f816:3eff:fe7f:e18d]:9618> failed: Invalid argument (connect errno = 22). Will keep trying for 20 total seconds (20 to go).

12/13/24 11:21:40 attempt to connect to <[fe80::f816:3eff:fe7f:e18d]:9618> failed: Invalid argument (connect errno = 22).

12/13/24 11:21:40 ERROR: SECMAN:2003:TCP connection to collector sleepybird03.cern.ch:9618 failed.

Finally, the terminating error shown in HADLog is:

12/13/24 10:29:23 HAD CONFIGURATION ERROR: my address '<188.184.103.96:51450?addrs=188.184.103.96-51450+[2001-1458-d00-3b--100-30d]-51450&alias=sleepybird03.cern.ch>'is not present in HAD_LIST 'sleepybird01.cern.ch:51450, sleepybird02.cern.ch:51450, sleepybird03.cern.ch:51450'

12/13/24 10:29:23 main_shutdown_graceful

Shortly after, the condor_had process exists, so it cannot get messages from other CMs.

I see the bug was fixed for 23.9.6, but was not fixed for 24.0? Maybe there might be a workaround for that (e.g. playing with nsswitch.conf)?

Thanks a lot.

Cheers,

Antonio

Mailing List Archives

Authenticated access

[HTCondor-users] HAD replication and link-local addresses in 24.0