Dear HTCondor experts, it seems our HA setup of two CMs is broken. I get the following regularly in TransferLog (with ADDRCM1 being address of CM1 and ADDRCM2 being address of CM2): ------------------------ 04/12/18 12:27:41 SharedPortEndpoint: waiting for connections to named socket 2142_fb90 04/12/18 12:27:41 DaemonCore: command socket at <ADDRCM1:9618?addrs=ADDRCM1-9618+[--1]-9618&noUDP&sock=2142_fb90> 04/12/18 12:27:41 DaemonCore: private command socket at <ADDRCM1:9618?addrs=ADDRCM1-9618+[--1]-9618&noUDP&sock=2142_fb90> 04/12/18 12:27:41 BaseReplicaTransferer::reinitialize started 04/12/18 12:29:48 attempt to connect to <ADDRCM2:43586> failed: Connection timed out (connect errno = 110). Will keep trying for 2147483647 total seconds (2147483520 to go). ------------------------ The port number "43586" is changing with each error message. Configuration is as follows (with FQDNCM1 and FQDNCM2 being the real FQDNs, of course): ------------------------ SHARED_PORT_PORT = 9618 SHARED_PORT_ARGS = -p $(SHARED_PORT_PORT) DAEMON_LIST = $(DAEMON_LIST), SHARED_PORT COLLECTOR_HOST = condor-cm1.physik.uni-bonn.de?sock=collector, condor-cm2.physik.uni-bonn.de?sock=collector USE_SHARED_PORT = true HAD_PORT = $(SHARED_PORT_PORT) HAD_USE_SHARED_PORT = TRUE REPLICATION_PORT = $(SHARED_PORT_PORT) REPLICATION_USE_SHARED_PORT = TRUE REPLICATION_LIST = FQDNCM1:$(REPLICATION_PORT), FQDNCM2:$(REPLICATION_PORT) HAD_LIST = FQDNCM1:$(HAD_PORT), FQDNCM2:$(HAD_PORT) ------------------------ Can somebody tell me why the transferer tries to connect to an arbitrary port? This is naturally blocked by our firewall, since we are using shared_port mode. Any help is appreciated. Cheers, Oliver
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature