Hi Gaeton,
I would agree that this seems like some sort of configuration issue. A good quick starting point if you haven't done so already is to run
condor_config_val -summary â. This will show all of the configuration macros that are set/changed.
-Cole Bollig
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Gaetan Geffroy <gage@xxxxxxxxx>
Sent: Friday, May 26, 2023 8:05 AM To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> Cc: Giovanni Scotti <gis@xxxxxxxxx>; Ruediger Gad <ruga@xxxxxxxxx> Subject: [HTCondor-users] CCB failing, causing jobs to stay IDLE forever Hi,
I have a HTCondor cluster (CM + Submit + 2 Executers), running version 10.0.3 and running on the company network. When I try to submit jobs to it from an execute node on that same network, everything looks fine. I also have some submit nodes started inside of a Kubernetes cluster. When trying to submit jobs from there, they stay IDLE forever. Before going further, I should precise that we had it working before, but we accidentally installed 10.4.0 instead of 10.0.3, and it started failing only after switching to the correct version. I donât believe the configuration files to have changed after the version change, but I canât guarantee it either.
In the NegotiatorLog I can see this at every cycle: 05/26/23 14:50:08 ---------- Started Negotiation Cycle ---------- 05/26/23 14:50:08 Phase 1: Obtaining ads from collector ... 05/26/23 14:50:08 Getting startd private ads ... 05/26/23 14:50:08 Getting Scheduler, Submitter and Machine ads ... 05/26/23 14:50:08 Sorting 9 ads ... 05/26/23 14:50:08 Got ads: 9 public and 8 private 05/26/23 14:50:08 Public ads include 1 submitter, 8 startd 05/26/23 14:50:08 Phase 2: Performing accounting ... 05/26/23 14:50:08 Phase 3: Sorting submitter ads by priority ... 05/26/23 14:50:08 Starting prefetch round; 1 potential prefetches to do. 05/26/23 14:50:08 Starting prefetch negotiation for my-user@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx. 05/26/23 14:50:08 Got NO_MORE_JOBS; schedd has no more requests 05/26/23 14:50:08 Prefetch summary: 1 attempted, 1 successful. 05/26/23 14:50:08 Phase 4.1: Negotiating with schedds ... 05/26/23 14:50:08 Negotiating with my-user@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx at <172.17.0.15:9618?CCBID=10.1.65.126:9618%3faddrs%3d10.1.65.126-9618%26alias%3dcondor-cm.my-company.com%26noUDP%26sock%3dcollector#4403&PrivNet=condor-submit.kubernetes.cluster.local&addrs=172.17.0.15-9618&alias=condor-submit.kubernetes.cluster.local&noUDP&sock=schedd_63_1af6> 05/26/23 14:50:08 0 seconds so far for this submitter 05/26/23 14:50:08 0 seconds so far for this schedd 05/26/23 14:50:08 Request 00001.00000: autocluster 1 (request count 1 of 1) 05/26/23 14:50:08 Matched 1.0 my-user@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx <172.17.0.15:9618?CCBID=10.1.65.126:9618%3faddrs%3d10.1.65.126-9618%26alias%3dcondor-cm.my-company.com%26noUDP%26sock%3dcollector#4403&PrivNet=condor-submit.kubernetes.cluster.local&addrs=172.17.0.15-9618&alias=condor-submit.kubernetes.cluster.local&noUDP&sock=schedd_63_1af6> preempting none <10.1.65.124:9618?CCBID=10.1.65.126:9618%3faddrs%3d10.1.65.126-9618%26alias%3dcondor-cm.my-company.com%26noUDP%26sock%3dcollector#4400&PrivNet=condor-node1.my-company.com&addrs=10.1.65.124-9618&alias=condor-node1.my-company.com&noUDP&sock=startd_5479_b77d> slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx 05/26/23 14:50:08 Successfully matched with slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx 05/26/23 14:50:08 negotiateWithGroup resources used submitterAds length 0 05/26/23 14:50:08 ---------- Finished Negotiation Cycle ----------
In the ScheddLog: 05/26/23 12:53:28 (pid:102) Number of Active Workers 0 05/26/23 12:53:38 (pid:102) Number of Active Workers 0 05/26/23 12:53:49 (pid:102) Number of Active Workers 0 05/26/23 12:53:59 (pid:102) Number of Active Workers 0 05/26/23 12:54:08 (pid:102) Activity on stashed negotiator socket: <10.1.65.126:9618> 05/26/23 12:54:08 (pid:102) Using negotiation protocol: NEGOTIATE 05/26/23 12:54:08 (pid:102) Negotiating for owner: my-user@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 05/26/23 12:54:08 (pid:102) Finished sending rrls to negotiator 05/26/23 12:54:08 (pid:102) Finished sending RRL for my-user 05/26/23 12:54:08 (pid:102) Activity on stashed negotiator socket: <10.1.65.126:9618> 05/26/23 12:54:08 (pid:102) Using negotiation protocol: NEGOTIATE 05/26/23 12:54:08 (pid:102) Negotiating for owner: my-user@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 05/26/23 12:54:08 (pid:102) SECMAN: removing lingering non-negotiated security session <10.1.65.124:9618?CCBID=10.1.65.126:9618%3faddrs%3d10.1.65.126-9618%26alias%3dcondor-cm.my-company.com%26noUDP%26sock%3dcollector#4400&PrivNet=condor-node1.my-company.com&addrs=10.1.65.124-9618&alias=condor-node1.my-company.com&noUDP&sock=startd_5479_b77d>#1685101045#1 because it conflicts with new request 05/26/23 12:54:08 (pid:102) CCBClient: WARNING: trying to connect to startd slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx <10.1.65.124:9618?CCBID=10.1.65.126:9618%3faddrs%3d10.1.65.126-9618%26alias%3dcondor-cm.my-company.com%26noUDP%26sock%3dcollector#4400&PrivNet=condor-node1.my-company.com&addrs=10.1.65.124-9618&alias=condor-node1.my-company.com&noUDP&sock=startd_5479_b77d> for my-user via CCB, but this appears to be a connection from one private network to another, which is not supported by CCB. Either that, or you have not configured the private network name to be the same in these two networks when it really should be. Assuming the latter. 05/26/23 12:54:08 (pid:102) Negotiation ended - 1 jobs matched 05/26/23 12:54:08 (pid:102) Finished negotiating for my-user in local pool: 1 matched, 0 rejected 05/26/23 12:54:09 (pid:102) Number of Active Workers 0 05/26/23 12:54:11 (pid:102) CCBClient: received failure message from CCB server 10.1.65.126:9618?addrs=10.1.65.126-9618&alias=condor-cm.my-company.com&noUDP&sock=collector in response to (non-blocking) request for reversed connection to startd slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx <10.1.65.124:9618?CCBID=10.1.65.126:9618%3faddrs%3d10.1.65.126-9618%26alias%3dcondor-cm.my-company.com%26noUDP%26sock%3dcollector#4400&PrivNet=condor-node1.my-company.com&addrs=10.1.65.124-9618&alias=condor-node1.my-company.com&noUDP&sock=startd_5479_b77d> for my-user: failed to connect 05/26/23 12:54:11 (pid:102) CCBClient: no more CCB servers to try for requesting reversed connection to startd slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx <10.1.65.124:9618?CCBID=10.1.65.126:9618%3faddrs%3d10.1.65.126-9618%26alias%3dcondor-cm.my-company.com%26noUDP%26sock%3dcollector#4400&PrivNet=condor-node1.my-company.com&addrs=10.1.65.124-9618&alias=condor-node1.my-company.com&noUDP&sock=startd_5479_b77d> for my-user; giving up. 05/26/23 12:54:11 (pid:102) Failed to send REQUEST_CLAIM to startd slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx <10.1.65.124:9618?CCBID=10.1.65.126:9618%3faddrs%3d10.1.65.126-9618%26alias%3dcondor-cm.my-company.com%26noUDP%26sock%3dcollector#4400&PrivNet=condor-node1.my-company.com&addrs=10.1.65.124-9618&alias=condor-node1.my-company.com&noUDP&sock=startd_5479_b77d> for my-user: SECMAN:2003:TCP connection to startd slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx <10.1.65.124:9618?CCBID=10.1.65.126:9618%3faddrs%3d10.1.65.126-9618%26alias%3dcondor-cm.my-company.com%26noUDP%26sock%3dcollector#4400&PrivNet=condor-node1.my-company.com&addrs=10.1.65.124-9618&alias=condor-node1.my-company.com&noUDP&sock=startd_5479_b77d> for my-user failed. 05/26/23 12:54:11 (pid:102) Match record (slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx <10.1.65.124:9618?CCBID=10.1.65.126:9618%3faddrs%3d10.1.65.126-9618%26alias%3dcondor-cm.my-company.com%26noUDP%26sock%3dcollector#4400&PrivNet=condor-node1.my-company.com&addrs=10.1.65.124-9618&alias=condor-node1.my-company.com&noUDP&sock=startd_5479_b77d> for my-user, 1.0) deleted
In the CollectorLog: 05/26/23 14:57:00 Got QUERY_STARTD_ADS 05/26/23 14:57:00 QueryWorker: forked new worker with id 63152 ( max 4 active 1 pending 0 ) 05/26/23 14:57:00 WARNING: forward resolution of localhost6 doesn't match 127.0.0.1! 05/26/23 14:57:00 WARNING: forward resolution of localhost6.localdomain6 doesn't match 127.0.0.1! 05/26/23 14:57:00 (Sending 8 ads in response to query) 05/26/23 14:57:00 Query info: matched=8; skipped=0; query_time=0.002500; send_time=0.000840; type=Machine; requirements={true}; locate=0; limit=0; from=TOOL; peer=<127.0.0.1:38076>; projection={Activity Arch CondorLoadAvg EnteredCurrentActivity LastHeardFrom Machine Memory MyCurrentTime Name OpSys State}; filter_private_attrs=1
In the StartLog on the worker nodes: literally nothing, no error, no warning.
I guess it is caused by something that was misconfigured somehow, but I canât find what it is. The CCB seems to be the reason why my jobs never start, because the schedd canât connect to the startd daemons outside of the kubernetes cluster. What I donât get is that it seems to connect to the CM just fine, and that one is outside of the Kubernetes cluster as weel.
Thanks,
GaÃtan
Attention: |