Hello, HTCondor users.
We have trouble to set up our HTCondor systems.
We want to configure HTCondor to apply a group management policy that guarantees minimum slots for each group.
In the Static Slot mode, it seems that the settings were working.
However, it did not work well with the dynamic slot mode.
The problem is as follows.
1. If the job's queue is small, it will not run at all.
For example, if the work queue is small, as shown below, condor_q -better-analyze <jobID> would look like this:
This means that the three machines are matched, but it is no machine available to run.
DiskUsage = 1
ImageSize = 1
RequestDisk = DiskUsage
RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,( ImageSize + 1023 ) / 1024)
The Requirements _expression_ for job 63.000 reduces to these conditions:
Slots
Step Matched Condition
----- -------- ---------
[0] 3 HasSingularity == true
[1] 3 TARGET.Arch == "X86_64"
[3] 3 TARGET.OpSys == "LINUX"
[5] 3 TARGET.Disk >= RequestDisk
[7] 3 TARGET.Memory >= RequestMemory
[9] 3 TARGET.HasFileTransfer
No successful match recorded.
Last failed match: Wed Sep 5 13:07:36 2018
Reason for last match failure: no match found
063.000: Run analysis summary ignoring user priority. Of 3 machines,
0 are rejected by your job's requirements
0 reject your job because of their own requirements
0 match and are already running your jobs
0 match but are serving other users
0 are available to run your job
2. When a job is already running but it consumes more than its minimum guaranteed slot number, we want that another group can take the slots from existing jobs. However, it did not respond.
The settings we have set are as follows.
######## Central Manager ###########
~~~~ Auth. ~~~~
NEGOTIATOR_INTERVAL = 20
TRUST_UID_DOMAIN = TRUE
START = TRUE
SUSPEND = FALSE
PREEMPT = TRUE
KILL = FALSE
REQUIRE_LOCAL_CONFIG_FILE = False
GROUP_NAMES = group_alice, group_cms
GROUP_QUOTA_group_alice = 84
GROUP_QUOTA_group_cms = 84
GROUP_ACCEPT_SURPLUS = true
NEGOTIATOR_CONSIDER_PREEMPTION = True
PREEMPTION_REQUIREMENTS = True
PREEMPTION_REQUIREMENTS = $(PREEMPTION_REQUIREMENTS) && (((SubmitterGroupResourcesInUse < SubmitterGroupQuota) && (RemoteGroupResourcesInUse > RemoteGroupQuota)) || (SubmitterGroup =?= RemoteGroup))
MAXJOBRETIREMENTTIME = 0
NEGOTIATOR_CONSIDER_EARLY_PREEMPTION = True
NEGOTIATOR_UPDATE_INTERVAL = 60
PREEMPTION_RANK = 2592000 - ifThenElse(isUndefined(TotalJobRuntime),0,TotalJobRuntime)
NEGOTIATOR_POST_JOB_RANK = 1
NEGOTIATOR_PRE_JOB_RANK = 1
PREEMPTION_RANK_STABLE = False
ALLOW_PSLOT_PREEMPTION = True
#DAGMAN_PENDING_REPORT_INTERVAL = 20
DEFRAG_INTERVAL = 60
DEFRAG_UPDATE_INTERVAL = 30
#########################
####### Startd#########
NEGOTIATOR_INTERVAL = 20
TRUST_UID_DOMAIN = TRUE
START = TRUE
SUSPEND = FALSE
PREEMPT = FALSE
KILL = FALSE
REQUIRE_LOCAL_CONFIG_FILE = False
NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=100%
SLOT_TYPE_1_PARTITIONABLE = true
SINGULARITY_JOB = !isUndefined(TARGET.SingularityImage)
SINGULARITY_IMAGE_EXPR = TARGET.SingularityImage
SINGULARITY_TARGET_DIR = /srv
MOUNT_UNDER_SCRATCH = /tmp, /var/tmp
SINGULARITY_BIND_EXPR=TARGET.SingularityBind
UPDATE_INTERVAL = 10
#MAXJOBRETIREMENTTIME=5
##################
Can you find a case where you have made similar settings?
We will welcome any comments.
Please tell us anything. Thank you.
Regards,
--------------------------------------------------------------------------------------------------
Geonmo Ryu / ëêë
Korea Institute of Science and Technology Information (KISTI)
Global Science Experimental Data Hub Center (GSDC)
245 Daehak-ro, Yuseong-gu, Daejeon, 305-806, Republic of Korea
Tel : +82-42-869-1639, +82-10-4337-9423
Mail : geonmo@xxxxxxxxxxx / ry840901@xxxxxxxxx
--------------------------------------------------------------------------------------------------