[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] gpu's and preemption



	Disclaimer: I am not a preemption expert.

ideally what i'd like to happen is that 4 of usera's jobs are
preempted for userb's.  just the fact that a user is asking for a gpu
should be enough to preempt another person from a slot that isn't
	This sounds like a job for a (machine) RANK expression. 
(Something like RANK = RequestGPUs, probably.)  That will work just fine 
with pslots, except ...
as extra credit, what happens when the box has 16 cores and 4 gpus,
and userb comes along and asks for two cpus/one gpu per job, does it
kick eight of usera's jobs off?
... it won't kick eight of usera's jobs off.  If you want to do that, 
you'll have to set ALLOW_PSLOT_PREEMPTION = TRUE in the configuration /of 
the negotiator/.  By itself, this doesn't allow priority-based preemption, 
but it will change the behavior of preemption with respect to pslots for 
your entire pool, so be you'll want to be sure that no other preemption is 
taking place (or that you understand the consequences).
	Be aware, however, that the way HTCondor combines slots when doing 
pslot preemption does /not/ wait for the corresponding jobs to finish 
exiting before reassigning their resources, so HTCondor may overcommit in 
some cases (e.g., the non-GPU jobs take so long to vacate that the GPU job 
finishes transferring and starts before they finish).
	If you don't set ALLOW_PSLOT_PREEMPTION, undersized dynamic slots 
will be ignored.  If you'd rather not preempt at all, you can attempt 
address the issue via draining, instead.
	If the issue happens rarely enough for manual intervention to be 
reasonable, starting with 8.9.0, you'll be able to address the issue with 
the condor_now command (whose version of slot coalescing doesn't suffer 
from the overcommit issue described above).  I suppose you could try to 
script the tool, but that seems fraught with peril.
- Toddm