Re: [HTCondor-users] HAD replication and link-local addresses in 24.0

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

hosts: files dns myhostname

- Jaime

On Dec 13, 2024, at 4:52âAM, Antonio Delgado Peris <antonio.delgado.peris@xxxxxxx> wrote:

Hi,

The third 24.0 CMs issue to report is that the HAD replication seems broken in our setup. This would indeed represent a problem.

The problem seems the one reported at https://opensciencegrid.atlassian.net/browse/HTCONDOR-2453, ie the link-local address of the name being got when resolving the local hostname. At least we do see the link-local address (fe80::...) shown in the logs, as described in the bug report. We don't see this for 23.0 CMs.

E.g. in HADLog of sleepybird03:

12/13/24 10:29:23 HADStateMachine::initializeHADList my address '<188.184.103.96:51450?addrs=188.184.103.96-51450+[2001-1458-d00-3b--100-30d]-51450&alias=sleepybird03.cern.ch>' vs. address in the list '<[fe80::f816:3eff:fe7f:e18d]:51450>'

And in MasterLog:

12/13/24 11:21:20 Started DaemonCore process "/usr/sbin/condor_replication", pid and pgroup = 409373

12/13/24 11:21:20 attempt to connect to <[fe80::f816:3eff:fe7f:e18d]:9618> failed: Invalid argument (connect errno = 22). Will keep trying for 20 total seconds (20 to go).

12/13/24 11:21:40 attempt to connect to <[fe80::f816:3eff:fe7f:e18d]:9618> failed: Invalid argument (connect errno = 22).

12/13/24 11:21:40 ERROR: SECMAN:2003:TCP connection to collector sleepybird03.cern.ch:9618failed.

Finally, the terminating error shown in HADLog is:

12/13/24 10:29:23 HAD CONFIGURATION ERROR: my address '<188.184.103.96:51450?addrs=188.184.103.96-51450+[2001-1458-d00-3b--100-30d]-51450&alias=sleepybird03.cern.ch>'is not present in HAD_LIST 'sleepybird01.cern.ch:51450, sleepybird02.cern.ch:51450, sleepybird03.cern.ch:51450'

12/13/24 10:29:23 main_shutdown_graceful

Shortly after, the condor_had process exists, so it cannot get messages from other CMs.

I see the bug was fixed for 23.9.6, but was not fixed for 24.0? Maybe there might be a workaround for that (e.g. playing with nsswitch.conf)?

Thanks a lot.

Cheers,

Antonio

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/

Mailing List Archives

Authenticated access

Re: [HTCondor-users] HAD replication and link-local addresses in 24.0