[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] CM HA failover working only in one direction



Hi all,

I have somewhat of a ratchet HA, which I am trying to understand.

So, I have to CMs configured to act as HA CMs with each other
  grid-htc-preprod-master01
  grid-htc-preprod-master02
(communicating via shared port)

The odd thing is, that after rebooting both machines into fresh sessions, grid-htc-preprod-master01 came up first and elevated itself to the cluster negotiator, which worked fine and was query'able from its sibling [1.a]. Then, I stopped the condor unit on grid-htc-preprod-master01 and grid-htc-preprod-master02 took over [1.b]. So far, so good. However, when starting again the condor unit on grid-htc-preprod-master01, the negotiator(s) failed on both (while the HAD is still around [1.c])

Judging from the log [2], I seem to have been missing some kind auf authorization - but in principle both machines have the same keys/tokens [3], so that I would have assumed that they both should be able to elevate themselves to their negotiator roles back & forth ð

My CM HA config is looking like [4] with Condor running on RHEL9.7 based on condor-25.0.3-1.el9.x86_64.

Maybe somebody has an idea, what might be amiss with my HA CMs and what could be broken with their capabilities?

Cheers,
  Thomas


[1.a]
[root@grid-htc-preprod-master02 ~]# condor_status -any -constraint 'MyType == "HAD"' MyType TargetType Name

HAD None condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx [root@grid-htc-preprod-master02 ~]# condor_status -any -constraint 'MyType == "Negotiator"' MyType TargetType Name

Negotiator         None               grid-htc-preprod-master01.desy.de

[1.b]
[root@grid-htc-preprod-master02 ~]# condor_status -any -constraint 'MyType == "HAD"' MyType TargetType Name

HAD None condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx [root@grid-htc-preprod-master02 ~]# condor_status -any -constraint 'MyType == "Negotiator"' MyType TargetType Name

Negotiator         None               grid-htc-preprod-master02.desy.de

[1.c]
[root@grid-htc-preprod-master02 ~]# condor_status -any -constraint 'MyType == "HAD"' MyType TargetType Name

HAD None condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx [root@grid-htc-preprod-master02 ~]# condor_status -any -constraint 'MyType == "Negotiator"'
[root@grid-htc-preprod-master02 ~]# echo $?
0


[2]
<131.169.223.129:9618>.
12/17/25 15:39:22 ReplicatorStateMachine::inLeaderStateHandler started with state = 3 12/17/25 15:40:04 ReplicatorStateMachine::inLeaderStateHandler started with state = 3 12/17/25 15:40:36 DC_AUTHENTICATE: authentication of <131.169.223.129:14235> was successful but resulted in a limited authorization which did not include this command (802 REPLICATION_NEWLY_JOINED_VERSION), so aborting.
12/17/25 15:40:36 DC_AUTHENTICATE: Command not authorized, done!
12/17/25 15:40:36 DC_AUTHENTICATE: authentication of <131.169.223.129:4743> was successful but resulted in a limited authorization which did not include this command (804 REPLICATION_SOLICIT_VERSION), so aborting.
12/17/25 15:40:36 DC_AUTHENTICATE: Command not authorized, done!
12/17/25 15:40:47 ReplicatorStateMachine::inLeaderStateHandler started with state = 3 12/17/25 15:41:29 ReplicatorStateMachine::inLeaderStateHandler started with state = 3 12/17/25 15:42:11 ReplicatorStateMachine::inLeaderStateHandler started with state = 3


[3]
[root@grid-htc-preprod-master01 ~]# sha256sum /etc/condor/passwords.d/*
cb92945df0cfb1...  /etc/condor/passwords.d/POOL
0ad5b76cf48f0a...  /etc/condor/passwords.d/POOL_COLLECTOR_SIGNING_KEY

[root@grid-htc-preprod-master02 ~]# sha256sum /etc/condor/passwords.d/*
cb92945df0cfb1...  /etc/condor/passwords.d/POOL
0ad5b76cf48f0a...  /etc/condor/passwords.d/POOL_COLLECTOR_SIGNING_KEY



[4]
#
# puppet: modules/htcondor/templates/etc/condor/config.d/01_CM_HA.conf.erb
#
# Hiera: values in hieradata/batch_ms.yaml
#
#         (Attention: changes will be overwritten by puppet)
#
# https://htcondor.readthedocs.io/en/latest/admin-manual/cm-configuration.html#high-availability-of-the-central-manager
#
# Since we're using shared port, we set the port number to the shared
# port daemon's port number.  NOTE: this assumes that each machine in
# the list is using the same port number for shared port.  While this
# will be true by default, if you've changed it in configuration any-
# where, you need to reflect that change here.

HAD_USE_SHARED_PORT = TRUE
HAD_LIST = \
$(CENTRAL_MANAGER1):$(SHARED_PORT_PORT), \
$(CENTRAL_MANAGER2):$(SHARED_PORT_PORT)

REPLICATION_USE_SHARED_PORT = TRUE
REPLICATION_LIST = \
$(CENTRAL_MANAGER1):$(SHARED_PORT_PORT), \
$(CENTRAL_MANAGER2):$(SHARED_PORT_PORT)


##If true, the first central manager in HAD_LIST is a primary.
HAD_USE_PRIMARY = TRUE

# If you change which daemon(s) you're making highly-available, you must
# change both of these values.
HAD_CONTROLLEE = NEGOTIATOR
MASTER_NEGOTIATOR_CONTROLLER = HAD

## THE FOLLOWING MAY DIFFER BETWEEN CENTRAL MANAGERS

# The daemon list may contain additional entries.
DAEMON_LIST = $(DAEMON_LIST), HAD, REPLICATION

# Using replication is optional.
HAD_USE_REPLICATION = TRUE


## HAD connection time.
## Recommended value is 2 if the central managers are on the same subnet.
## Recommended value is 5 if Condor security is enabled.
## Recommended value is 10 if the network is very slow, or
## to reduce the sensitivity of HA daemons to network failures.
HAD_CONNECTION_TIMEOUT = 5

# This is the default location for the state file.
STATE_FILE = $(SPOOL)/Accountantnew.log

## Period of time between two successive awakenings of the replication daemon
## Default: 300
REPLICATION_INTERVAL = 300

## Period of time, in which transferer daemons have to accomplish the
## downloading/uploading process
## Default: 300
MAX_TRANSFER_LIFETIME = 300


## Period of time between two successive sends of classads to the collector by HAD
## Default: 300
HAD_UPDATE_INTERVAL = 300


## The HAD controls the negotiator, and should have a larger
## backoff constant
MASTER_NEGOTIATOR_CONTROLLER = HAD
MASTER_HAD_BACKOFF_CONSTANT = 360

#
# puppet: modules/htcondor/templates/etc/condor/config.d/01_CM_HA.conf.erb
#
# Hiera: values in hieradata/batch_ms.yaml
# 
#         (Attention: changes will be overwritten by puppet)
#
# https://htcondor.readthedocs.io/en/latest/admin-manual/cm-configuration.html#high-availability-of-the-central-manager
#
# Since we're using shared port, we set the port number to the shared
# port daemon's port number.  NOTE: this assumes that each machine in
# the list is using the same port number for shared port.  While this
# will be true by default, if you've changed it in configuration any-
# where, you need to reflect that change here.

HAD_USE_SHARED_PORT = TRUE
HAD_LIST = \
$(CENTRAL_MANAGER1):$(SHARED_PORT_PORT), \
$(CENTRAL_MANAGER2):$(SHARED_PORT_PORT)

REPLICATION_USE_SHARED_PORT = TRUE
REPLICATION_LIST = \
$(CENTRAL_MANAGER1):$(SHARED_PORT_PORT), \
$(CENTRAL_MANAGER2):$(SHARED_PORT_PORT)


##If true, the first central manager in HAD_LIST is a primary.
HAD_USE_PRIMARY = TRUE

# If you change which daemon(s) you're making highly-available, you must
# change both of these values.
HAD_CONTROLLEE = NEGOTIATOR
MASTER_NEGOTIATOR_CONTROLLER = HAD

## THE FOLLOWING MAY DIFFER BETWEEN CENTRAL MANAGERS

# The daemon list may contain additional entries.
DAEMON_LIST = $(DAEMON_LIST), HAD, REPLICATION

# Using replication is optional.
HAD_USE_REPLICATION = TRUE


## HAD connection time.
## Recommended value is 2 if the central managers are on the same subnet.
## Recommended value is 5 if Condor security is enabled.
## Recommended value is 10 if the network is very slow, or
## to reduce the sensitivity of HA daemons to network failures.
HAD_CONNECTION_TIMEOUT = 5

# This is the default location for the state file.
STATE_FILE = $(SPOOL)/Accountantnew.log

## Period of time between two successive awakenings of the replication daemon
## Default: 300
REPLICATION_INTERVAL = 300

## Period of time, in which transferer daemons have to accomplish the
## downloading/uploading process
## Default: 300
MAX_TRANSFER_LIFETIME = 300


## Period of time between two successive sends of classads to the collector by HAD
## Default: 300
HAD_UPDATE_INTERVAL = 300


## The HAD controls the negotiator, and should have a larger
## backoff constant
MASTER_NEGOTIATOR_CONTROLLER = HAD
MASTER_HAD_BACKOFF_CONSTANT = 360

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature