We have seen a very similar problem to #1 at Fermilab. Reported a few months ago and fix is in the works. What Christoph said about consumption policy enabling will work as long as your pool is not too big. They are going to refactor the matching code also if consumption policies are not in place. (We do not use consumption policies). We don't run pre-emption and so have no experience on #2.
Steve
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of "Geonmo Ryu" <geonmo@xxxxxxxxxxx>
Sent: Tuesday, September 4, 2018 11:25:51 PM To: htcondor-users@xxxxxxxxxxx Subject: [HTCondor-users] Group Quota and dynamic Slot. Hello, HTCondor users.
We have trouble to set up our HTCondor systems.
We want to configure HTCondor to apply a group management policy that guarantees minimum slots for each group.
In the Static Slot mode, it seems that the settings were working.
However, it did not work well with the dynamic slot mode.
The problem is as follows. 1. If the job's queue is small, it will not run at all.
For example, if the work queue is small, as shown below, condor_q -better-analyze <jobID> would look like this: This means that the three machines are matched, but it is no machine available to run. DiskUsage = 1 ImageSize = 1 RequestDisk = DiskUsage RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,( ImageSize + 1023 ) / 1024) The Requirements _expression_ for job 63.000 reduces to these conditions: Slots Step Matched Condition ----- -------- --------- [0] 3 HasSingularity == true [1] 3 TARGET.Arch == "X86_64" [3] 3 TARGET.OpSys == "LINUX" [5] 3 TARGET.Disk >= RequestDisk [7] 3 TARGET.Memory >= RequestMemory [9] 3 TARGET.HasFileTransfer No successful match recorded. Last failed match: Wed Sep 5 13:07:36 2018 Reason for last match failure: no match found 063.000: Run analysis summary ignoring user priority. Of 3 machines, 0 are rejected by your job's requirements 0 reject your job because of their own requirements 0 match and are already running your jobs 0 match but are serving other users 0 are available to run your job
2. When a job is already running but it consumes more than its minimum guaranteed slot number, we want that another group can take the slots from existing jobs. However, it did not respond.
The settings we have set are as follows. ######## Central Manager ########### ~~~~ Auth. ~~~~ NEGOTIATOR_INTERVAL = 20 TRUST_UID_DOMAIN = TRUE START = TRUE SUSPEND = FALSE PREEMPT = TRUE KILL = FALSE REQUIRE_LOCAL_CONFIG_FILE = False GROUP_NAMES = group_alice, group_cms GROUP_QUOTA_group_alice = 84 GROUP_QUOTA_group_cms = 84 GROUP_ACCEPT_SURPLUS = true NEGOTIATOR_CONSIDER_PREEMPTION = True PREEMPTION_REQUIREMENTS = True PREEMPTION_REQUIREMENTS = $(PREEMPTION_REQUIREMENTS) && (((SubmitterGroupResourcesInUse < SubmitterGroupQuota) && (RemoteGroupResourcesInUse > RemoteGroupQuota)) || (SubmitterGroup =?= RemoteGroup)) MAXJOBRETIREMENTTIME = 0 NEGOTIATOR_CONSIDER_EARLY_PREEMPTION = True NEGOTIATOR_UPDATE_INTERVAL = 60 PREEMPTION_RANK = 2592000 - ifThenElse(isUndefined(TotalJobRuntime),0,TotalJobRuntime) NEGOTIATOR_POST_JOB_RANK = 1 NEGOTIATOR_PRE_JOB_RANK = 1 PREEMPTION_RANK_STABLE = False ALLOW_PSLOT_PREEMPTION = True #DAGMAN_PENDING_REPORT_INTERVAL = 20 DEFRAG_INTERVAL = 60 DEFRAG_UPDATE_INTERVAL = 30 ######################### ####### Startd######### NEGOTIATOR_INTERVAL = 20 TRUST_UID_DOMAIN = TRUE START = TRUE SUSPEND = FALSE PREEMPT = FALSE KILL = FALSE REQUIRE_LOCAL_CONFIG_FILE = False NUM_SLOTS = 1 NUM_SLOTS_TYPE_1 = 1 SLOT_TYPE_1 = cpus=100% SLOT_TYPE_1_PARTITIONABLE = true SINGULARITY_JOB = !isUndefined(TARGET.SingularityImage) SINGULARITY_IMAGE_EXPR = TARGET.SingularityImage SINGULARITY_TARGET_DIR = /srv MOUNT_UNDER_SCRATCH = /tmp, /var/tmp SINGULARITY_BIND_EXPR=TARGET.SingularityBind UPDATE_INTERVAL = 10 #MAXJOBRETIREMENTTIME=5 ##################
Can you find a case where you have made similar settings?
We will welcome any comments.
Please tell us anything. Thank you.
Regards,
--------------------------------------------------------------------------------------------------
|