Hi,
I am looking for ideas, why two nodes have troubles accepting/starting
jobs. Both nodes have been recently spawned defined as
DEV_RESOURCE = true
- on which no other nodes in the cluster match.
But all my jobs with 'DEV_RESOURCE' as only requirement do not start -
although the jobs request [1.a] and nodes resources [1.b] match 'in
principal' - with a slot and a job being found.
The nominal group's share (aka OTHER) should be sufficient (and also no
other user/group's job is matching the nodes' resources, i.e., the nodes
are idling).
The negotiator rejects the jobs as it cannot find a match [2] - where I
am convinced that it should match (comparing the nodes' ads with the
request it should(?) fit) (with the same info ending up at the scheduler
[3])
Maybe somebody has a hint for me, why the matchmaking might be failing
here??
Cheers,
Thomas
[1.a]
> condor_q -better-analyze 55.0
-- Schedd: grid-vm08.desy.de : <131.169.223.234:9620?...
The Requirements expression for job 55.000 is
(TARGET.DEV_RESOURCE) && (TARGET.Arch == "X86_64") && (TARGET.OpSys
== "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >=
RequestMemory) && (TARGET.HasFileTransfer)
Job 55.000 defines the following attributes:
DiskUsage = 3
RequestDisk = DiskUsage
RequestMemory = 2500
The Requirements expression for job 55.000 reduces to these conditions:
Slots
Step Matched Condition
----- -------- ---------
[0] 2 TARGET.DEV_RESOURCE
No successful match recorded.
Last failed match: Wed Apr 11 11:39:52 2018
Reason for last match failure: no match found
055.000: Run analysis summary ignoring user priority. Of 353 machines,
351 are rejected by your job's requirements
0 reject your job because of their own requirements
0 match and are already running your jobs
0 match but are serving other users
2 are able to run your job
[1.b]
> condor_q -better-analyze 55.0 -reverse -machine wn12-test.desy.de
-- Schedd: grid-vm08.desy.de : <131.169.223.234:9620?...
-- Slot: slot1@xxxxxxxxxxxxxxxxx : Analyzing matches for 1 Jobs in 1
autoclusters
The Requirements expression for this slot is
(START) && (IsValidCheckpointPlatform) &&
(WithinResourceLimits)
START is
(NODE_IS_HEALTHY is true) &&
(StartJobs is true)
IsValidCheckpointPlatform is
(TARGET.JobUniverse isnt 1 ||
((MY.CheckpointPlatform isnt undefined) &&
((TARGET.LastCheckpointPlatform is MY.CheckpointPlatform) ||
(TARGET.NumCkpts == 0))))
WithinResourceLimits is
(ifThenElse(TARGET._condor_RequestCpus isnt undefined,MY.Cpus > 0 &&
TARGET._condor_RequestCpus <=
MY.Cpus,ifThenElse(TARGET.RequestCpus isnt undefined,MY.Cpus > 0 &&
TARGET.RequestCpus <= MY.Cpus,1 <= MY.Cpus)) &&
ifThenElse(TARGET._condor_RequestMemory isnt undefined,MY.Memory >
0 &&
TARGET._condor_RequestMemory <=
MY.Memory,ifThenElse(TARGET.RequestMemory isnt undefined,MY.Memory > 0 &&
TARGET.RequestMemory <= MY.Memory,false)) &&
ifThenElse(TARGET._condor_RequestDisk isnt undefined,MY.Disk > 0 &&
TARGET._condor_RequestDisk <=
MY.Disk,ifThenElse(TARGET.RequestDisk isnt undefined,MY.Disk > 0 &&
TARGET.RequestDisk <= MY.Disk,false)))
This slot defines the following attributes:
CheckpointPlatform = "LINUX X86_64 3.10.0-693.21.1.el7.x86_64 normal
N/A ssse3 sse4_1 sse4_2"
Cpus = 16
Disk = 68089928
Memory = 48124
NODE_IS_HEALTHY = true
StartJobs = true
Job 55.0 has the following attributes:
TARGET.JobUniverse = 5
TARGET.NumCkpts = 0
TARGET.RequestCpus = 1
TARGET.RequestDisk = 3
TARGET.RequestMemory = 2500
The Requirements expression for this slot reduces to these conditions:
Clusters
Step Matched Condition
----- -------- ---------
[3] 1 IsValidCheckpointPlatform
[5] 1 WithinResourceLimits
slot1@xxxxxxxxxxxxxxxxx: Run analysis summary of 1 jobs.
1 (100.00 %) match both slot and job requirements.
1 match the requirements of this slot.
1 have job requirements that match this slot.
[2]
> NegotiatorLog
...
04/11/18 12:18:44 ---------- Started Negotiation Cycle ----------
04/11/18 12:18:44 Phase 1: Obtaining ads from collector ...
04/11/18 12:18:44 Getting startd private ads ...
04/11/18 12:18:45 Getting Scheduler, Submitter and Machine ads ...
04/11/18 12:18:50 Sorting 11071 ads ...
04/11/18 12:18:50 Got ads: 11071 public and 11021 private
04/11/18 12:18:50 Public ads include 25 submitter, 11021 startd
04/11/18 12:18:51 Phase 2: Performing accounting ...
...
04/11/18 12:18:53 group quotas: WARNING: dynamic quota for group
group_OPS rescaled from 0.9 to 0.321429
04/11/18 12:18:53 group quotas: WARNING: dynamic quota for group
group_OTHER rescaled from 0.1 to 0.0357143
04/11/18 12:18:53 group quotas: allocation round 1
04/11/18 12:18:53 group quotas: groups= 9 requesting= 5 served= 5
unserved= 0 slots= 10911 requested= 25736 allocated= 25736 surplus=
3422 maxdelta= 6842
04/11/18 12:18:53 group quotas: entering RR iteration n= 6842
...
04/11/18 12:18:53 Group group_OPS - skipping, zero slots allocated
04/11/18 12:18:53 Group group_OTHER - BEGIN NEGOTIATION
04/11/18 12:18:53 Phase 3: Sorting submitter ads by priority ...
04/11/18 12:18:53 Phase 4.1: Negotiating with schedds ...
04/11/18 12:18:53 Negotiating with group_OTHER.other.grid@xxxxxxx at
<131.169.223.234:9620?addrs=131.169.223.234-9620+[2001-638-700-10df--1-ea]-9620&noUDP&sock=22340_f5d7_3>
04/11/18 12:18:53 0 seconds so far for this submitter
04/11/18 12:18:53 0 seconds so far for this schedd
04/11/18 12:18:53 Got NO_MORE_JOBS; schedd has no more requests
04/11/18 12:18:53 Request 00055.00000: autocluster 3 (request count
1 of 1)
04/11/18 12:18:53 Rejected 55.0 group_OTHER.other.grid@xxxxxxx
<131.169.223.234:9620?addrs=131.169.223.234-9620+[2001-638-700-10df--1-ea]-9620&noUDP&sock=22340_f5d7_3>:
no match found
04/11/18 12:18:53 Request 00056.00000: autocluster 8 (request count
1 of 11)
04/11/18 12:18:53 Rejected 56.0 group_OTHER.other.grid@xxxxxxx
<131.169.223.234:9620?addrs=131.169.223.234-9620+[2001-638-700-10df--1-ea]-9620&noUDP&sock=22340_f5d7_3>:
no match found
04/11/18 12:18:53 Request 00059.00000: autocluster 9 (request count
1 of 1)
04/11/18 12:18:53 Rejected 59.0 group_OTHER.other.grid@xxxxxxx
<131.169.223.234:9620?addrs=131.169.223.234-9620+[2001-638-700-10df--1-ea]-9620&noUDP&sock=22340_f5d7_3>:
no match found
04/11/18 12:18:53 Got NO_MORE_JOBS; schedd has no more requests
04/11/18 12:18:53 Negotiating with group_OTHER.other.chbeyer@xxxxxxx
at
<131.169.56.33:9620?addrs=131.169.56.33-9620+[--1]-9620&noUDP&sock=2496730_b07b_6>
04/11/18 12:18:53 0 seconds so far for this submitter
04/11/18 12:18:53 0 seconds so far for this schedd
04/11/18 12:18:53 Got NO_MORE_JOBS; schedd has no more requests
04/11/18 12:18:53 Request 00118.00000: autocluster 1 (request count
1 of 1)
04/11/18 12:18:53 Rejected 118.0 group_OTHER.other.chbeyer@xxxxxxx
<131.169.56.33:9620?addrs=131.169.56.33-9620+[--1]-9620&noUDP&sock=2496730_b07b_6>:
no match found
04/11/18 12:18:53 Got NO_MORE_JOBS; schedd has no more requests
04/11/18 12:18:53 negotiateWithGroup resources used scheddAds length 2
[3]
> SchedLog
04/11/18 12:18:01 (pid:22384) Number of Active Workers 0
04/11/18 12:18:03 (pid:22384) Number of Active Workers 0
04/11/18 12:18:04 (pid:22384) Number of Active Workers 0
04/11/18 12:18:04 (pid:22384) Number of Active Workers 0
04/11/18 12:18:07 (pid:22384) Number of Active Workers 0
04/11/18 12:18:14 (pid:22384) Number of Active Workers 0
04/11/18 12:18:23 (pid:22384) Activity on stashed negotiator socket:
<131.169.56.33:28841>
04/11/18 12:18:23 (pid:22384) Using negotiation protocol: NEGOTIATE
04/11/18 12:18:23 (pid:22384) Negotiating for owner:
group_OTHER.other.grid@xxxxxxx
04/11/18 12:18:23 (pid:22384) Finished negotiating for
group_OTHER.other.grid in local pool: 0 matched, 3 rejected
04/11/18 12:18:53 (pid:22384) Activity on stashed negotiator socket:
<131.169.56.33:28841>
04/11/18 12:18:53 (pid:22384) Using negotiation protocol: NEGOTIATE
04/11/18 12:18:53 (pid:22384) Negotiating for owner:
group_OTHER.other.grid@xxxxxxx
04/11/18 12:18:53 (pid:22384) Finished negotiating for
group_OTHER.other.grid in local pool: 0 matched, 3 rejected
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature