Hi condor team,
I'm trying to configure a pool with group accounting and partitionable slots and account for Memory usage as well as CPUs. To achieve that, I'm trying to use a SLOT_WEIGHT based on Memory rather than Cpus.
Is SLOT_WEIGHT based on Memory supported? ÂÂ
When I change SLOT_WEIGHT to an _expression_ using Memory, Condor no longer fully utilizes my partitionable slots. Worse, depending on the SLOT_WEIGHT a submission may not be able to complete all jobs.
It appears that the Scheduler calculates the WEIGHT of the submission locally in a way that is unrelated to the SLOT_TYPE_X_SLOT_WEIGHT. Â Â
(For now, I can mostly work around this on the submitter side by scaling RequestCpus based on RequestMemory. But it seems like slot_weight should work and I just have something misconfigured.)
Ben
Here's an example:
Cluster:Â
1 execute node with 80 pslots (79 for SLOT_TYPE_2, + 1 for SLOT_TYPE_1 Â= (8 detected_cpus * 10)) and 14GB Memory
SLOT_TYPE_2_SLOT_WEIGHT = floor( Memory / 178 )
Submit File:
$ cat group_b.normal.sub
Universe = vanilla
Executable = /bin/sleep
Arguments = 60
Notification = Never
Should_transfer_files = IF_NEEDED
When_to_transfer_output = ON_EXIT
accounting_group = group_b.normal
RequestMemory = 512
Queue 10
Expected behavior:
All jobs should run in a single negotiator cycle, since 10 jobs use only 5GB of 14GB Memory and 5 of 79 pslots.
Actual behavior:
Only 3 jobs will run concurrently and at the end 2 jobs will be unable to match.
Negotiator Log for the first Match:
01/06/17 23:08:23 Phase 4.1: ÂNegotiating with schedds ...
01/06/17 23:08:23 Â Â numSlots = 2
01/06/17 23:08:23 Â Â slotWeightTotal = 10.000000
01/06/17 23:08:23 Â Â pieLeft = 10.000
01/06/17 23:08:23 Â Â NormalFactor = 1.000000
01/06/17 23:08:23 Â Â MaxPrioValue = 500.000000
01/06/17 23:08:23 Â Â NumSubmitterAds = 1
...
01/06/17 23:08:23 Match completed, match cost= 3
So I can only run 3 jobs concurrently even though there is actualy plenty of space for all 10 jobs. Â
It seems like slotWeightTotal should have been 30.000 for this submission.
I've configured my execute nodes as follows:
# Make each core 1/10th of a real core to simulate a larger node
NUM_CPUS =Â $(DETECTED_CPUS) * 10
CONSUMPTION_POLICY = TRUE
CLAIM_PARTITIONABLE_LEFTOVERS = False
# Reserved slot
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=1 memory=256 disk=2%
SLOT_TYPE_1_PARTITIONABLE = FALSE
NUM_SLOTS_TYPE_2 = 1
SLOT_TYPE_2 = cpus=auto
SLOT_TYPE_2_PARTITIONABLE = TRUE
SLOT_TYPE_2_CONSUMPTION_CPUS = quantize(target.RequestCpus, {1})
SLOT_TYPE_2_CONSUMPTION_MEMORY = quantize(target.RequestMemory, {178})
SLOT_TYPE_2_CONSUMPTION_DISK = quantize(target.RequestDisk, {1024})
# Account for the jobs' memory when considering resourcesÂ
# consumed as part of the accounting group. Â
# (weight = 178 because SLOT_TYPE_2 has 14GB RAM and 79 pslots)
SLOT_TYPE_2_SLOT_WEIGHT = floor( Memory / 178 )
NUM_CLAIMS = $(NUM_CPUS)
SLOT_TYPE_2_NUM_CLAIMS = $(NUM_CLAIMS)
Negotiatory config:
#########################
#########################
GROUP_NAMES = group_a, group_a.production, group_a.nonprod, group_b, group_b.regression, group_b.normal, group_b.nightly
GROUP_ACCEPT_SURPLUS = TRUE
GROUP_AUTOREGROUP = FALSE
PRIORITY_HALFLIFE = 1.0e100
GROUP_QUOTA_DYNAMIC_group_a = 0.2
GROUP_QUOTA_DYNAMIC_group_a.production = 0.25
GROUP_ACCEPT_SURPLUS_group_a.production = TRUE
GROUP_QUOTA_DYNAMIC_group_a.nonprod = 0.75
GROUP_ACCEPT_SURPLUS_group_a.nonprod = FALSE
#GROUP_QUOTA_group_b = quantize(.60 * $(NUM_POOL_SLOTS), 1)
GROUP_QUOTA_DYNAMIC_group_b = 0.8
GROUP_QUOTA_DYNAMIC_group_b.regression = .6
GROUP_ACCEPT_SURPLUS_group_b.regression = TRUE
GROUP_QUOTA_DYNAMIC_group_b.normal = .2
GROUP_ACCEPT_SURPLUS_group_b.normal = TRUE
GROUP_QUOTA_DYNAMIC_group_b.nightly = .2
GROUP_ACCEPT_SURPLUS_group_b.nightly = FALSE
--
--
Ben Watrous
Cycle ComputingÂ
Better Answers. Faster.
twitter: @cyclecomputing