[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HAD replication and link-local addresses in 24.0



The fix attempted in 23.9.6 was defective. Version 24.0.3 will have a proper fix (https://opensciencegrid.atlassian.net/browse/HTCONDOR-2746).
As a workaround, you can modify the âhostsâ line in /etc/nsswitch.conf so that âmyhostnameâ appears last:

hosts:      files dns myhostname

 - Jaime

On Dec 13, 2024, at 4:52âAM, Antonio Delgado Peris <antonio.delgado.peris@xxxxxxx> wrote:

Hi,
 
The third 24.0 CMs issue to report is that the HAD replication seems broken in our setup. This would indeed represent a problem.
 
The problem seems the one reported at https://opensciencegrid.atlassian.net/browse/HTCONDOR-2453, ie the link-local address of the name being got when resolving the local hostname. At least we do see the link-local address (fe80::...) shown in the logs, as described in the bug report. We don't see this for 23.0 CMs.
 
E.g. in HADLog of sleepybird03:
 
12/13/24 10:29:23 HADStateMachine::initializeHADList my address '<188.184.103.96:51450?addrs=188.184.103.96-51450+[2001-1458-d00-3b--100-30d]-51450&alias=sleepybird03.cern.ch>' vs. address in the list '<[fe80::f816:3eff:fe7f:e18d]:51450>'
 
And in MasterLog:
 
12/13/24 11:21:20 Started DaemonCore process "/usr/sbin/condor_replication", pid and pgroup = 409373
12/13/24 11:21:20 attempt to connect to <[fe80::f816:3eff:fe7f:e18d]:9618> failed: Invalid argument (connect errno = 22).  Will keep trying for 20 total seconds (20 to go).
12/13/24 11:21:40 attempt to connect to <[fe80::f816:3eff:fe7f:e18d]:9618> failed: Invalid argument (connect errno = 22).
12/13/24 11:21:40 ERROR: SECMAN:2003:TCP connection to collector sleepybird03.cern.ch:9618failed.
 
 
Finally, the terminating error shown in HADLog is:
 
12/13/24 10:29:23 HAD CONFIGURATION ERROR:  my address '<188.184.103.96:51450?addrs=188.184.103.96-51450+[2001-1458-d00-3b--100-30d]-51450&alias=sleepybird03.cern.ch>'is not present in HAD_LIST 'sleepybird01.cern.ch:51450, sleepybird02.cern.ch:51450, sleepybird03.cern.ch:51450'
12/13/24 10:29:23 main_shutdown_graceful
 
Shortly after, the condor_had process exists, so it cannot get messages from other CMs.
 
I see the bug was fixed for 23.9.6, but was not fixed for 24.0? Maybe there might be a workaround for that (e.g. playing with nsswitch.conf)?
 
Thanks a lot.
 
Cheers,
   Antonio
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/