[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Partitionable Slots



On 6/19/2014 11:53 AM, Douglas Thain wrote:
Howdy -

We have been using partitionable slots to run multi-core jobs for the
last few months.  We are set up to have a single partitionable slot
and no static slots, divided by CPU.   Our users are submitting a mix
of jobs, using request_cpus to select the size of slot desired.

When initially turned on, it works.  Slots get created for the exact
size of each job, so that, for example, a two-core job is matched to a
two-core slot.  However, after a while, jobs begin to be matched in
slots that are too big.  For example, we see lots of one-cpu jobs
running on 4-cpu slots.

How do we fix this so that jobs only run in slots of the appropriate size?

(Based on some previous discussions, we set CLAIM_WORKLIFE=0, so as to
force claims to expire at the end of each job, thus causing them to be
returned to the parent partitionable slot.  But, that doesn't seem to
be happening.)

The relevant configuration is:

NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=100%
SLOT_TYPE_1_PARTITIONABLE = true
CLAIM_WORKLIFE = 0

Any suggestions?


Hi Doug -

I tried the above config on my Windows 7 laptop using the v8.2.0 release candidate and everything seemed to work as expected, i.e. with CLAIM_WORKLIFE = 0 the claim expired at the end of the job and thus the dynamic slot was returned to the parent partitionable slot. When I commented out the CLAIM_WORKLIFE=0 line, then the dynamic slot was reused.

Besides the CLAIM_WORKLIFE = 0 trick (that is a condor_startd knob, btw, maybe you only set it in your central manager), you could also put the following into your job requirements expression:

  requirements = DynamicSlot =!= True || Cpus =?= RequestCpus

The above will allow the job to match any static or partitionable slot, but if the slot is a dynamic slot, it will only re-use the dynamic slot if it has the same number of CPUs. I like this better than the CLAIM_WORKLIFE workaround because you still get the advantages of reusing claims.

Of course, you could inject the above into all jobs via APPEND_REQUIREMENTS in the config file, or you could opt to put the above constraint into the startd START expression to enforce this "exact CPU fit" policy on the startd side.

Hope the above helps,
Todd



Doug

P.S. We have a neat little display that shows slot size based on CPU:
http://condor.cse.nd.edu/condor_matrix.cgi
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685