Hi Greg,
I ran some jobs while having two shells showing the outputs of
watch -n 1 tail -n 100 /var/log/condor/NegotiatorLog and
watch -n 1 tail -n 100 /var/log/condor/SchedLog on the CM and the Submit node, respectively.
What I can tell you for sure is that the problem is not with the starter. The job goes in running state and finishes within seconds after the two logs mentioned above report a successful negotiation.
The SchedLog showed the following:
- Shortly after submitting the job, it showed this:
04/24/23 10:24:52 (pid:84) (cid:35) actOnJobs: didn't do any work, aborting - A few seconds later, it looks like there was some trouble communicating with the negotiator:
04/24/23 10:24:57 (pid:84) Can't find address for negotiator
04/24/23 10:24:57 (pid:84) Failed to send RESCHEDULE to unknown daemon:
I had just restarted my test pool, maybe it did not have time to fully boot back up yet. - 15 seconds later, it could talk with the negotiator again:
04/24/23 10:25:11 (pid:84) Using negotiation protocol: NEGOTIATE
04/24/23 10:25:11 (pid:84) Negotiating for owner:
default.mmepi@xxxxxxxxxxxxxxxxxx
04/24/23 10:25:11 (pid:84) Rebuilt prioritized runnable job list in 0.000s.
04/24/23 10:25:11 (pid:84) Finished sending rrls to negotiator
04/24/23 10:25:11 (pid:84) Finished sending RRL for default.mmepi - For the next 5 minutes, it only showed these two lines, again and again:
04/24/23 10:30:10 (pid:84) Number of Active Workers 0
04/24/23 10:30:11 (pid:84) Received a superuser command - Then, exactly 5 minutes after step 3, this happened:
04/24/23 10:30:12 (pid:84) Activity on stashed negotiator socket: <172.21.0.1:40539>
04/24/23 10:30:12 (pid:84) Using negotiation protocol: NEGOTIATE
04/24/23 10:30:12 (pid:84) Negotiating for owner:
default.mmepi@xxxxxxxxxxxxxxxxxx
04/24/23 10:30:12 (pid:84) AutoCluster:config() significant attributes changed to MachineLastMatchTime Offline RemoteOwner RequestCpus RequestDisk RequestMemory TotalJobRuntime ConcurrencyLimits FlockTo Rank Requirements
04/24/23 10:30:12 (pid:84) Rebuilt prioritized runnable job list in 0.000s.
04/24/23 10:30:12 (pid:84) Finished sending rrls to negotiator
04/24/23 10:30:12 (pid:84) Finished sending RRL for default.mmepi
04/24/23 10:30:12 (pid:84) Activity on stashed negotiator socket: <172.21.0.1:40539>
04/24/23 10:30:12 (pid:84) Using negotiation protocol: NEGOTIATE
04/24/23 10:30:12 (pid:84) Negotiating for owner:
default.mmepi@xxxxxxxxxxxxxxxxxx
04/24/23 10:30:12 (pid:84) Negotiation ended - 1 jobs matched
04/24/23 10:30:12 (pid:84) Finished negotiating for default.mmepi in local pool: 1 matched, 0 rejected
04/24/23 10:30:12 (pid:84) Starting add_shadow_birthdate(1.0)
04/24/23 10:30:12 (pid:84) Started shadow for job 1.0 on
slot1@xxxxxxxxxxxxxxxxxxxxx <172.21.0.4:9618?CCBID=172.21.0.2:9618%3faddrs%3d172.21.0.2-9618%26alias%3dcondor.test.cm%26noUDP%26sock%3dcollector#2&PrivNet=condor.test.execute.1&addrs=172.21.0.4-9618&alias=condor.test.execute.1&noUDP&sock=startd_28_c8bd>
for default.mmepi, (shadow pid = 4196) - After that, the job went from Idle to Running and completed shortly after
For the NegotiatorLog:
- At the exact time where the above step 3 happened, there is a corresponding entry in the Negotiator logs:
04/24/23 10:25:11 ---------- Started Negotiation Cycle ----------
04/24/23 10:25:11 Phase 1: Obtaining ads from collector ...
04/24/23 10:25:11 Getting startd private ads ...
04/24/23 10:25:11 Getting Scheduler, Submitter and Machine ads ...
04/24/23 10:25:11 Sorting 1 ads ...
04/24/23 10:25:11 Got ads: 1 public and 0 private
04/24/23 10:25:11 Public ads include 1 submitter, 0 startd
04/24/23 10:25:11 Phase 2: Performing accounting ...
04/24/23 10:25:11 Phase 3: Sorting submitter ads by priority ...
04/24/23 10:25:11 Starting prefetch round; 1 potential prefetches to do.
04/24/23 10:25:11 Starting prefetch negotiation for
default.mmepi@xxxxxxxxxxxxxxxxxx.
04/24/23 10:25:11 Got NO_MORE_JOBS; schedd has no more requests
04/24/23 10:25:11 Prefetch summary: 1 attempted, 1 successful.
04/24/23 10:25:11 Phase 4.1: Negotiating with schedds ...
04/24/23 10:25:11 Negotiating with
default.mmepi@xxxxxxxxxxxxxxxxxx at <172.21.0.3:9618?addrs=172.21.0.3-9618&alias=condor.test.submit&noUDP&sock=schedd_65_29be>
04/24/23 10:25:11 0 seconds so far for this submitter
04/24/23 10:25:11 0 seconds so far for this schedd
04/24/23 10:25:11 Negotiation with
default.mmepi@xxxxxxxxxxxxxxxxxx skipped because submitterCeiling remaining is 2147483647
04/24/23 10:25:11 negotiateWithGroup resources used submitterAds length 0
04/24/23 10:25:11 ---------- Finished Negotiation Cycle ----------
Looks like this submitterCeiling thing is causing the delay ? - Then, absolutely nothing for 5 minutes:
04/24/23 10:29:52 ---------- Started Negotiation Cycle ----------
04/24/23 10:29:52 Phase 1: Obtaining ads from collector ...
04/24/23 10:29:52 Getting startd private ads ...
04/24/23 10:29:52 Getting Scheduler, Submitter and Machine ads ...
04/24/23 10:29:52 Sorting 1 ads ...
04/24/23 10:29:52 Got ads: 1 public and 1 private
04/24/23 10:29:52 Public ads include 0 submitter, 1 startd
04/24/23 10:29:52 Phase 2: Performing accounting ...
04/24/23 10:29:52 Phase 3: Sorting submitter ads by priority ...
04/24/23 10:29:52 Starting prefetch round; 0 potential prefetches to do.
04/24/23 10:29:52 Prefetch summary: 0 attempted, 0 successful.
04/24/23 10:29:52 Phase 4.1: Negotiating with schedds ...
04/24/23 10:29:52 negotiateWithGroup resources used submitterAds length 0
04/24/23 10:29:52 ---------- Finished Negotiation Cycle ---------- - And again, after exactly five minutes:
04/24/23 10:30:12 ---------- Started Negotiation Cycle ----------
04/24/23 10:30:12 Phase 1: Obtaining ads from collector ...
04/24/23 10:30:12 Getting startd private ads ...
04/24/23 10:30:12 Getting Scheduler, Submitter and Machine ads ...
04/24/23 10:30:12 Sorting 2 ads ...
04/24/23 10:30:12 Got ads: 2 public and 1 private
04/24/23 10:30:12 Public ads include 1 submitter, 1 startd
04/24/23 10:30:12 Phase 2: Performing accounting ...
04/24/23 10:30:12 Phase 3: Sorting submitter ads by priority ...
04/24/23 10:30:12 Starting prefetch round; 1 potential prefetches to do.
04/24/23 10:30:12 Starting prefetch negotiation for
default.mmepi@xxxxxxxxxxxxxxxxxx.
04/24/23 10:30:12 Got NO_MORE_JOBS; schedd has no more requests
04/24/23 10:30:12 Prefetch summary: 1 attempted, 1 successful.
04/24/23 10:30:12 Phase 4.1: Negotiating with schedds ...
04/24/23 10:30:12 Negotiating with
default.mmepi@xxxxxxxxxxxxxxxxxx at <172.21.0.3:9618?addrs=172.21.0.3-9618&alias=condor.test.submit&noUDP&sock=schedd_65_29be>
04/24/23 10:30:12 0 seconds so far for this submitter
04/24/23 10:30:12 0 seconds so far for this schedd
04/24/23 10:30:12 Request 00001.00000: autocluster 1 (request count 1 of 1)
04/24/23 10:30:12 Matched 1.0
default.mmepi@xxxxxxxxxxxxxxxxxx <172.21.0.3:9618?addrs=172.21.0.3-9618&alias=condor.test.submit&noUDP&sock=schedd_65_29be> preempting none <172.21.0.4:9618?CCBID=172.21.0.2:9618%3faddrs%3d172.21.0.2-9618%26alias%3dcondor.test.cm%26noUDP%26sock%3dcollector#2&PrivNet=condor.test.execute.1&addrs=172.21.0.4-9618&alias=condor.test.execute.1&noUDP&sock=startd_28_c8bd>
slot1@xxxxxxxxxxxxxxxxxxxxx
04/24/23 10:30:12 Successfully matched with
slot1@xxxxxxxxxxxxxxxxxxxxx
04/24/23 10:30:12 negotiateWithGroup resources used submitterAds length 1
04/24/23 10:30:12 ---------- Finished Negotiation Cycle ----------
And again, from there the job goes in running state and completes.
So, it looks like this submitterCeiling thing is what causes my issue. But I donât get it, the user I am using to submit has no other job in the pool, there are no other jobs running anywhere.
Why does it consider it has reached that ceiling, why does it wait exactly 5 minutes and why is it fine after 5 minutes ?
Thanks,
GaÃtan
Gaetan Geffroy
Junior Software Engineer
Terma GmbH
T +49 6151 86005 43 (direct)
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
On Behalf Of Greg Thain via HTCondor-users
Sent: Friday, April 21, 2023 20:01
To: htcondor-users@xxxxxxxxxxx
Cc: Greg Thain <gthain@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Jobs take ages to start
CAUTION:
This email originated from outside of Terma. Do not click links or open attachments unless you recognize the sender and know the content is safe.
On 4/21/23 07:48, Gaetan Geffroy wrote:
Hi,
Some of my jobs stay in IDLE for a really long time compared to what I have configured.
On the submit node, I set NEGOTIATOR_INTERVAL to 5 seconds, expecting jobs to start extremely quickly after being submitted.
This is a local pool for test and training purposes, so I actually want the jobs to start and execute almost immediately.
When looking at the NegotiatorLog file, it looks like the negotiation cycles are happening as often as intended, yet most of my jobs stay in queue for 5 minutes.
Those are docker universe jobs. Condor_q -better-analyze tells me there are machines capable of running my jobs.
Even weirder, I tried to download one of my docker images as a TAR archive and load manually in Docker, and then the jobs using that image start immediately.
The same jobs using the same image but from a Docker repo take 5 minutes to start.
Is there another knob I could turn to speed up the process ?
This depends on where in the process the delay is. Do know know if this is a delay in matchmaking? in the schedd? In the starter once the job has begun to start?
-greg
|