Hi all, I have somewhat of a ratchet HA, which I am trying to understand. So, I have to CMs configured to act as HA CMs with each other grid-htc-preprod-master01 grid-htc-preprod-master02 (communicating via shared port)The odd thing is, that after rebooting both machines into fresh sessions, grid-htc-preprod-master01 came up first and elevated itself to the cluster negotiator, which worked fine and was query'able from its sibling [1.a]. Then, I stopped the condor unit on grid-htc-preprod-master01 and grid-htc-preprod-master02 took over [1.b]. So far, so good. However, when starting again the condor unit on grid-htc-preprod-master01, the negotiator(s) failed on both (while the HAD is still around [1.c])
Judging from the log [2], I seem to have been missing some kind auf authorization - but in principle both machines have the same keys/tokens [3], so that I would have assumed that they both should be able to elevate themselves to their negotiator roles back & forth ð
My CM HA config is looking like [4] with Condor running on RHEL9.7 based on condor-25.0.3-1.el9.x86_64.
Maybe somebody has an idea, what might be amiss with my HA CMs and what could be broken with their capabilities?
Cheers, Thomas [1.a][root@grid-htc-preprod-master02 ~]# condor_status -any -constraint 'MyType == "HAD"' MyType TargetType Name
HAD None condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx [root@grid-htc-preprod-master02 ~]# condor_status -any -constraint 'MyType == "Negotiator"' MyType TargetType Name
Negotiator None grid-htc-preprod-master01.desy.de [1.b][root@grid-htc-preprod-master02 ~]# condor_status -any -constraint 'MyType == "HAD"' MyType TargetType Name
HAD None condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx [root@grid-htc-preprod-master02 ~]# condor_status -any -constraint 'MyType == "Negotiator"' MyType TargetType Name
Negotiator None grid-htc-preprod-master02.desy.de [1.c][root@grid-htc-preprod-master02 ~]# condor_status -any -constraint 'MyType == "HAD"' MyType TargetType Name
HAD None condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx [root@grid-htc-preprod-master02 ~]# condor_status -any -constraint 'MyType == "Negotiator"'
[root@grid-htc-preprod-master02 ~]# echo $? 0 [2] <131.169.223.129:9618>.12/17/25 15:39:22 ReplicatorStateMachine::inLeaderStateHandler started with state = 3 12/17/25 15:40:04 ReplicatorStateMachine::inLeaderStateHandler started with state = 3 12/17/25 15:40:36 DC_AUTHENTICATE: authentication of <131.169.223.129:14235> was successful but resulted in a limited authorization which did not include this command (802 REPLICATION_NEWLY_JOINED_VERSION), so aborting.
12/17/25 15:40:36 DC_AUTHENTICATE: Command not authorized, done!12/17/25 15:40:36 DC_AUTHENTICATE: authentication of <131.169.223.129:4743> was successful but resulted in a limited authorization which did not include this command (804 REPLICATION_SOLICIT_VERSION), so aborting.
12/17/25 15:40:36 DC_AUTHENTICATE: Command not authorized, done!12/17/25 15:40:47 ReplicatorStateMachine::inLeaderStateHandler started with state = 3 12/17/25 15:41:29 ReplicatorStateMachine::inLeaderStateHandler started with state = 3 12/17/25 15:42:11 ReplicatorStateMachine::inLeaderStateHandler started with state = 3
[3] [root@grid-htc-preprod-master01 ~]# sha256sum /etc/condor/passwords.d/* cb92945df0cfb1... /etc/condor/passwords.d/POOL 0ad5b76cf48f0a... /etc/condor/passwords.d/POOL_COLLECTOR_SIGNING_KEY [root@grid-htc-preprod-master02 ~]# sha256sum /etc/condor/passwords.d/* cb92945df0cfb1... /etc/condor/passwords.d/POOL 0ad5b76cf48f0a... /etc/condor/passwords.d/POOL_COLLECTOR_SIGNING_KEY [4] # # puppet: modules/htcondor/templates/etc/condor/config.d/01_CM_HA.conf.erb # # Hiera: values in hieradata/batch_ms.yaml # # (Attention: changes will be overwritten by puppet) ## https://htcondor.readthedocs.io/en/latest/admin-manual/cm-configuration.html#high-availability-of-the-central-manager
# # Since we're using shared port, we set the port number to the shared # port daemon's port number. NOTE: this assumes that each machine in # the list is using the same port number for shared port. While this # will be true by default, if you've changed it in configuration any- # where, you need to reflect that change here. HAD_USE_SHARED_PORT = TRUE HAD_LIST = \ $(CENTRAL_MANAGER1):$(SHARED_PORT_PORT), \ $(CENTRAL_MANAGER2):$(SHARED_PORT_PORT) REPLICATION_USE_SHARED_PORT = TRUE REPLICATION_LIST = \ $(CENTRAL_MANAGER1):$(SHARED_PORT_PORT), \ $(CENTRAL_MANAGER2):$(SHARED_PORT_PORT) ##If true, the first central manager in HAD_LIST is a primary. HAD_USE_PRIMARY = TRUE # If you change which daemon(s) you're making highly-available, you must # change both of these values. HAD_CONTROLLEE = NEGOTIATOR MASTER_NEGOTIATOR_CONTROLLER = HAD ## THE FOLLOWING MAY DIFFER BETWEEN CENTRAL MANAGERS # The daemon list may contain additional entries. DAEMON_LIST = $(DAEMON_LIST), HAD, REPLICATION # Using replication is optional. HAD_USE_REPLICATION = TRUE ## HAD connection time. ## Recommended value is 2 if the central managers are on the same subnet. ## Recommended value is 5 if Condor security is enabled. ## Recommended value is 10 if the network is very slow, or ## to reduce the sensitivity of HA daemons to network failures. HAD_CONNECTION_TIMEOUT = 5 # This is the default location for the state file. STATE_FILE = $(SPOOL)/Accountantnew.log## Period of time between two successive awakenings of the replication daemon
## Default: 300 REPLICATION_INTERVAL = 300 ## Period of time, in which transferer daemons have to accomplish the ## downloading/uploading process ## Default: 300 MAX_TRANSFER_LIFETIME = 300## Period of time between two successive sends of classads to the collector by HAD
## Default: 300 HAD_UPDATE_INTERVAL = 300 ## The HAD controls the negotiator, and should have a larger ## backoff constant MASTER_NEGOTIATOR_CONTROLLER = HAD MASTER_HAD_BACKOFF_CONSTANT = 360
# # puppet: modules/htcondor/templates/etc/condor/config.d/01_CM_HA.conf.erb # # Hiera: values in hieradata/batch_ms.yaml # # (Attention: changes will be overwritten by puppet) # # https://htcondor.readthedocs.io/en/latest/admin-manual/cm-configuration.html#high-availability-of-the-central-manager # # Since we're using shared port, we set the port number to the shared # port daemon's port number. NOTE: this assumes that each machine in # the list is using the same port number for shared port. While this # will be true by default, if you've changed it in configuration any- # where, you need to reflect that change here. HAD_USE_SHARED_PORT = TRUE HAD_LIST = \ $(CENTRAL_MANAGER1):$(SHARED_PORT_PORT), \ $(CENTRAL_MANAGER2):$(SHARED_PORT_PORT) REPLICATION_USE_SHARED_PORT = TRUE REPLICATION_LIST = \ $(CENTRAL_MANAGER1):$(SHARED_PORT_PORT), \ $(CENTRAL_MANAGER2):$(SHARED_PORT_PORT) ##If true, the first central manager in HAD_LIST is a primary. HAD_USE_PRIMARY = TRUE # If you change which daemon(s) you're making highly-available, you must # change both of these values. HAD_CONTROLLEE = NEGOTIATOR MASTER_NEGOTIATOR_CONTROLLER = HAD ## THE FOLLOWING MAY DIFFER BETWEEN CENTRAL MANAGERS # The daemon list may contain additional entries. DAEMON_LIST = $(DAEMON_LIST), HAD, REPLICATION # Using replication is optional. HAD_USE_REPLICATION = TRUE ## HAD connection time. ## Recommended value is 2 if the central managers are on the same subnet. ## Recommended value is 5 if Condor security is enabled. ## Recommended value is 10 if the network is very slow, or ## to reduce the sensitivity of HA daemons to network failures. HAD_CONNECTION_TIMEOUT = 5 # This is the default location for the state file. STATE_FILE = $(SPOOL)/Accountantnew.log ## Period of time between two successive awakenings of the replication daemon ## Default: 300 REPLICATION_INTERVAL = 300 ## Period of time, in which transferer daemons have to accomplish the ## downloading/uploading process ## Default: 300 MAX_TRANSFER_LIFETIME = 300 ## Period of time between two successive sends of classads to the collector by HAD ## Default: 300 HAD_UPDATE_INTERVAL = 300 ## The HAD controls the negotiator, and should have a larger ## backoff constant MASTER_NEGOTIATOR_CONTROLLER = HAD MASTER_HAD_BACKOFF_CONSTANT = 360
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature