Hi. I'm reposting this message from a couple weeks ago to see if anyone can
help me understand how eviction calculations are done in the presence of
dynamic provisioning. I continue to see jobs meet my preemption requirements
not being evicted, and so users are getting starved. Since slots are
dynamic, it's not even clear that eviction means anything, since slots get
destroyed and recreated, is that right? So I'm guessing there's another
paradigm to balance users out.
Thanks.
Greg Langmead | Senior Research Scientist | SDL Language Weaver | (t) +1 310
437 7300
On 9/22/10 3:58 PM, "Greg Langmead"<glangmead@xxxxxxxxxxxxxxxxxx> wrote:
We have a pool of about 1000 cpu (Fedora Core 12) being managed with Condor
7.4.2 with dynamic provisioning. The queue is busy today and I'm not liking
what I'm seeing. I have three users each with a long backlog of idle jobs:
- user1 has lots of jobs with request_memory = 8000, request_cpus = 1, and
priority 112
- user2 has lots of jobs with request_memory = 6000, request_cpus = 1, and
priority 5
- user3 has lots of jobs with request_memory = 4000, request_cpus = 1, and
priority 32
Most machines have 32G of RAM with 16 actual cores (but one partitionable
Condor slot). My eviction settings (honed over 3 years of usage in the static
provisioning environment) are:
PREEMPTION_REQUIREMENTS = ( (CurrentTime - EnteredCurrentState)> (6 * (60 *
60)))&& ( RemoteUserPrio> (SubmitterPrio * 1.2 ))
i.e., let jobs run for 6 hours, after which they can be evicted by a user with
better priority by a factor of 1.2.
Here's what I'm observing:
- the user with request_memory = 8000 is getting jobs served every negotiation
cycle
- the user with request_memory = 6000 is starved out
- the user with request_memory = 4000 is getting jobs served every negotiation
cycle
Moreover the 8G jobs may run indefinitely and are never evicted. Many have run
for over 8 hours.
So my question is, how is eviction done with dynamic provisioning? Is a 6G job
even compared to an 8G one to see if might preempt it? Also, when a new
negotiation cycle is started, how could an 8G job from a user with terrible
priority get run when a 6G job from a user with better priority does not? I
can understand why a 4G job might slip in when a 6G one doesn't fit, so it's
the 8G versus 6G competition that is not working.
Many thanks,
Greg