Hi, The third 24.0 CMs
issue to report is that the HAD replication seems
broken in our setup. This would indeed represent a problem. The problem seems the one reported at
https://opensciencegrid.atlassian.net/browse/HTCONDOR-2453, ie the link-local address of the name being got when resolving the local hostname. At least we do see the link-local address (fe80::...) shown in the logs, as described in the bug report. We don't
see this for 23.0 CMs. E.g. in HADLog of sleepybird03: 12/13/24 10:29:23 HADStateMachine::initializeHADList my address '<188.184.103.96:51450?addrs=188.184.103.96-51450+[2001-1458-d00-3b--100-30d]-51450&alias=sleepybird03.cern.ch>' vs. address
in the list '<[fe80::f816:3eff:fe7f:e18d]:51450>' And in MasterLog: 12/13/24 11:21:20 Started DaemonCore process "/usr/sbin/condor_replication", pid and pgroup = 409373 12/13/24 11:21:20 attempt to connect to <[fe80::f816:3eff:fe7f:e18d]:9618> failed: Invalid argument (connect errno = 22). Will keep trying for 20 total seconds (20 to go). 12/13/24 11:21:40 attempt to connect to <[fe80::f816:3eff:fe7f:e18d]:9618> failed: Invalid argument (connect errno = 22). 12/13/24 11:21:40 ERROR: SECMAN:2003:TCP connection to collector sleepybird03.cern.ch:9618 failed. Finally, the terminating error shown in HADLog is: 12/13/24 10:29:23 HAD CONFIGURATION ERROR: my address '<188.184.103.96:51450?addrs=188.184.103.96-51450+[2001-1458-d00-3b--100-30d]-51450&alias=sleepybird03.cern.ch>'is not present in
HAD_LIST 'sleepybird01.cern.ch:51450, sleepybird02.cern.ch:51450, sleepybird03.cern.ch:51450' 12/13/24 10:29:23 main_shutdown_graceful Shortly after, the condor_had process exists, so it cannot get messages from other CMs. I see the bug was fixed for 23.9.6, but was not fixed for 24.0? Maybe there might be a workaround for that (e.g. playing with nsswitch.conf)? Thanks a lot. Cheers, Antonio |