Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] condor_defrag only some machines?
- Date: Thu, 12 Jan 2017 20:16:27 +0000
- From: Michael Pelletier <Michael.V.Pelletier@xxxxxxxxxxxx>
- Subject: Re: [HTCondor-users] condor_defrag only some machines?
Mike,
The ExpectedMachineGracefulDrainingBadput is an estimate of how much work will be lost when the jobs running on that machine are evicted. When a machine is drained, the jobs are instructed to gracefully evict, which means they are sent a TERM signal (by default) and allowed up to the MaxJobRetirementTime (default of zero) to shut down before being kill -9'd.
A machine with 10 jobs which have accumulated 30 minutes each, if evicted, will have a minimum of 300 minutes of badput, while a machine with 1 job with 60 minutes of runtime will have 60 minutes of badput if evicted, so it will be chosen for draining ahead of the first machine.
Have you taken a look at pslot preemption? I wonder if that might be more useful for your situation than defragmenting. It seems like that might give you more control over when a whole-machine job can evict the single-core jobs, and avoid any draining at all if there are no whole-machine jobs waiting to run.
Also, make sure that you're doing a depth-first fill of the machines for the single-core jobs, which may give the whole-machine jobs a better fighting chance; and make sure your job_lease_duration is set to something reasonable - the default is 40 minutes, but I usually use 20 (it depends on the characteristics of your jobs).
-Michael Pelletier.
-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Michael Di Domenico
Sent: Thursday, January 12, 2017 1:22 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_defrag only some machines?
having let the pool run for a while longer, it does appear to have pulled in some of the nodes that originally weren't.
so i guess what this really boils down to is that I don't understand what
DEFRAG_RANK = -ExpectedMachineGracefulDrainingBadput
really means as it relates to the current state of my pool
I can see ExpectedMachineGracefulDrainingBadput is a classadd attached to each of the machines in my pool, which represents a calculated number, but i don't fully understand it
i see the explination in the manual, but it's still not clear. does anyone have a pointer to something that might make it more clear how this is actually choosing machines to set to draining state?