Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] job has all necessary slots claimed but still is idle
- Date: Thu, 10 Jun 2021 23:15:54 +0300 (MSK)
- From: "Stanislav V. Markevich" <stanislav.markevich@xxxxxxxxxxxxxx>
- Subject: [HTCondor-users] job has all necessary slots claimed but still is idle
Hi all,
My job stuck in IDLE state and I cannot understand why.
It's the only job:
-- Schedd: parallel_schedd@xxxxxxxxxxxxxxxxxx : <10.42.0.253:45173?... @ 06/10/21 18:26:37
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
user20002 ID: 49 6/10 17:06 _ _ 2 2 49.0-1
Total for query: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
Total for all users: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
There are suitable slots (dynamic slots created for this job):
Name OpSys State Activity Cpus Memory
slot1@xxxxxxxxxxxxxxxxxxxx LINUX Unclaimed Idle 75 1544
slot1_1@xxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 25 256
slot1@xxxxxxxxxxxxxxxxxxxx LINUX Unclaimed Idle 0 8
slot1_1@xxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 100 1792
But "condor_q -better-analyze" says there is no successful match (and last failed match was an hour ago):
-- Schedd: parallel_schedd@xxxxxxxxxxxxxxxxxx : <10.42.0.253:45173?...
The Requirements expression for job 49.000 is
...
Job 49.000 defines the following attributes:
DiskUsage = 0
FileSystemDomain = "p7-htcondor"
NumJobStarts = 0
RequestCpus = 25
RequestDisk = DiskUsage
RequestMemory = 256
The Requirements expression for job 49.000 reduces to these conditions:
...
No successful match recorded.
Last failed match: Thu Jun 10 17:26:33 2021
Reason for last match failure: PREEMPTION_REQUIREMENTS == False
049.000: Run analysis summary ignoring user priority. Of 2 machines,
0 are rejected by your job's requirements
0 reject your job because of their own requirements
0 match and are already running your jobs
0 match but are serving other users
2 are able to run your job
The Requirements expression for job 49.001 is
...
Job 49.001 defines the following attributes:
DiskUsage = 42
FileSystemDomain = "p7-htcondor"
NumJobStarts = 0
RequestCpus = 100
RequestDisk = DiskUsage
RequestMemory = 1792
The Requirements expression for job 49.001 reduces to these conditions:
...
No successful match recorded.
Last failed match: Thu Jun 10 17:26:33 2021
Reason for last match failure: no match found
049.001: Run analysis summary ignoring user priority. Of 2 machines,
0 are rejected by your job's requirements
0 reject your job because of their own requirements
0 match and are already running your jobs
0 match but are serving other users
2 are able to run your job
And there are successful matches in logs:
17:57:50.766 condor_negotiator[51]: ---------- Started Negotiation Cycle ----------
17:57:50.766 condor_negotiator[51]: Phase 1: Obtaining ads from collector ...
17:57:50.766 condor_negotiator[51]: Getting startd private ads ...
17:57:50.771 condor_negotiator[51]: Getting Scheduler, Submitter and Machine ads ...
17:57:50.777 condor_negotiator[51]: Sorting 4 ads ...
17:57:50.777 condor_negotiator[51]: Got ads: 4 public and 2 private
17:57:50.777 condor_negotiator[51]: Public ads include 1 submitter, 2 startd
17:57:50.777 condor_negotiator[51]: Phase 2: Performing accounting ...
17:57:50.789 condor_schedd[143]: Activity on stashed negotiator socket: <10.42.0.18:35399>
17:57:50.789 condor_schedd[143]: Using negotiation protocol: NEGOTIATE
17:57:50.789 condor_schedd[143]: Negotiating for owner: DedicatedScheduler@parallel_schedd@submit.p7-htcondor
17:57:50.789 condor_negotiator[51]: Phase 3: Sorting submitter ads by priority ...
17:57:50.789 condor_negotiator[51]: Starting prefetch round; 1 potential prefetches to do.
17:57:50.789 condor_negotiator[51]: Starting prefetch negotiation for DedicatedScheduler@parallel_schedd@submit.p7-htcondor.
17:57:50.791 condor_schedd[143]: Finished sending rrls to negotiator
17:57:50.791 condor_schedd[143]: Finished negotiating for DedicatedScheduler@parallel_schedd in local pool: 0 matched, 0 rejected
17:57:50.791 condor_schedd[143]: Activity on stashed negotiator socket: <10.42.0.18:35399>
17:57:50.791 condor_schedd[143]: Using negotiation protocol: NEGOTIATE
17:57:50.791 condor_schedd[143]: Negotiating for owner: DedicatedScheduler@parallel_schedd@submit.p7-htcondor
17:57:50.791 condor_negotiator[51]: Got NO_MORE_JOBS; schedd has no more requests
17:57:50.791 condor_negotiator[51]: Prefetch summary: 1 attempted, 1 successful.
17:57:50.791 condor_negotiator[51]: Phase 4.1: Negotiating with schedds ...
17:57:50.791 condor_negotiator[51]: Negotiating with DedicatedScheduler@parallel_schedd@submit.p7-htcondor at <10.42.0.253:37745?addrs=10.42.0.253-37745&alias=submit.p7-htcondor&noUDP&sock=schedd_98_1cc6>
17:57:50.791 condor_negotiator[51]: 0 seconds so far for this submitter
17:57:50.791 condor_negotiator[51]: 0 seconds so far for this schedd
17:57:50.791 condor_negotiator[51]: Request 00049.00000: autocluster -1 (request count 1 of 0)
17:57:50.791 condor_negotiator[51]: Matched 49.0 DedicatedScheduler@parallel_schedd@submit.p7-htcondor <10.42.0.253:37745?addrs=10.42.0.253-37745&alias=submit.p7-htcondor&noUDP&sock=schedd_98_1cc6> preempting none <10.42.0.17:33779?addrs=10.42.0.17-33779&alias=p7-htcondorexecute-gb8z4.p7-htcondor&noUDP&sock=startd_99_cc5f> slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
17:57:50.800 condor_startd[135]: slot1_1: New machine resource of type -1 allocated
17:57:50.801 condor_startd[135]: slot1_1: Request accepted.
17:57:50.802 condor_schedd[143]: SetAttribute modifying attribute Scheduler in nonexistent job 49.2
17:57:50.804 condor_schedd[143]: SetAttribute modifying attribute Scheduler in nonexistent job 49.2
17:57:50.805 condor_schedd[143]: Found 2 potential dedicated resources in 0 seconds
17:57:50.805 condor_schedd[143]: Adding submitter DedicatedScheduler@parallel_schedd@submit.p7-htcondor to the submitter map for default pool.
17:57:50.807 condor_negotiator[51]: Successfully matched with slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
17:57:50.807 condor_negotiator[51]: Request 00049.00001: autocluster -1 (request count 1 of 0)
17:57:50.807 condor_negotiator[51]: Matched 49.1 DedicatedScheduler@parallel_schedd@submit.p7-htcondor <10.42.0.253:37745?addrs=10.42.0.253-37745&alias=submit.p7-htcondor&noUDP&sock=schedd_98_1cc6> preempting none <10.42.0.19:37885?addrs=10.42.0.19-37885&alias=p7-htcondorexecute-vxbqp.p7-htcondor&noUDP&sock=startd_87_913c> slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
17:57:50.811 condor_startd[123]: slot1_1: New machine resource of type -1 allocated
17:57:50.819 condor_schedd[143]: SetAttribute modifying attribute Scheduler in nonexistent job 49.2
17:57:50.819 condor_schedd[143]: SetAttribute modifying attribute Scheduler in nonexistent job 49.2
17:57:50.818 condor_startd[135]: slot1_1: Remote owner is user20002@p7-htcondor
17:57:50.819 condor_startd[135]: slot1_1: State change: claiming protocol successful
17:57:50.819 condor_startd[135]: slot1_1: Changing state: Owner -> Claimed
17:57:50.819 condor_startd[123]: slot1_1: Request accepted.
17:57:50.826 condor_negotiator[51]: Successfully matched with slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
17:57:50.826 condor_negotiator[51]: negotiateWithGroup resources used submitterAds length 0
17:57:50.827 condor_negotiator[51]: ---------- Finished Negotiation Cycle ----------
Maybe this is important, in the configuration files I have:
CLAIM_PARTITIONABLE_LEFTOVERS = FALSE
CONSUMPTION_POLICY = TRUE
Am I missing something? Is this a bug?
PS: after two hours of waiting I submitted another job and surprisingly both jobs ran pretty soon.
Best regards,
Stanislav V. Markevich