[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Is ExpectedMachineGracefulDrainingBadput the sum of subslots and related defrag questions



Hi again,

we were still testing some "sensible" condor_defrag approach and would like to get some feedback.

Right now, we don't run condor_defrag as most of our jobs don't give us a handle on how long they run and if they perform internal checkpointing, but on the other hand, we start accumulating idle jobs which require the maximum possible number of CPU cores. Thus, we want to add condor_defrag somewhat safely to our large CPU core count machines:

The current plan would be for users to declare up to two extra items in their submit files:

# short running job and/or internal checkpointing
+KillableJob = true/false
# estimated max run time
+ExectedRuntimeHours = num

and on the startd site, we would set for the majority of nodes

# this ought to work even if ExpectedRuntimeHours were undefined, right?
START = $(START) && (KillableJob =?= true || ExpectedRuntimeHours <= 6)
MaxJobRetirementTime = 6 * 3600
MachineMaxVacateTime = 150

and for a much smaller share of the pool we would set
START = true
MaxJobRetirementTime = min( { ifthenelse(isUndefined(ExpectedRuntimeHours), 24, ExpectedRunTimeHours), 24*14 } ) * 3600
MachineMaxVacateTime = 150

This should effectively steer longer running jobs to the few machines while allowing shorter running jobs without checkpointing feature or checkpointable/killable jobs to run everywhere.

On the defrag side, we would set something like
DAEMON_LIST = $(DAEMON_LIST) DEFRAG
DEFRAG_INTERVAL = 1800
DEFRAG_DRAINING_MACHINES_PER_HOUR = 6.0
DEFRAG_MAX_CONCURRENT_DRAINING = 10
DEFRAG_REQUIREMENTS = PartitionableSlot && Offline =!= True && TotalCpus >= 50
DEFRAG_DRAINING_START_EXPR = (KillableJob =?= true)
DEFRAG_UPDATE_INTERVAL = 300
DEFRAG_WHOLE_MACHINE_EXPR = PartitionableSlot && Cpus == TotalSlotCpus && Offline =!= True && TotalCpus >= 50
DEFRAG_SCHEDULE = graceful

This should allow "killable" jobs to be scheduled there and prevent too much badput while waiting for the machines to finish defrag.

What we are not really certain about if the default

DEFRAG_RANK = -ExpectedMachineGracefulDrainingBadput

is taking all subslots into account (by summing these?) or the manual's "slot" really means the highest ranking subslot - we only run a single partition-able slot per host machine.

Also, while I expect DEFRAG_RANK to mostly steer condor_defrag to the machines with lower MaxJobVacateTime should we worry about DEFRAG_MAX_CONCURRENT_DRAINING = 10 if we have many more than 10 of the second kind of machines defined? If so, any idea which handle to use to ensure a good turn-around time?

Testing on a two node set-up looked good so far, but I wanted to gather some feedback here, to ensure that this makes any sense or if we missed something vital/obvious.

Have we?

Cheers and thanks a lot in advance!

Carsten

--
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany, Phone +49 511 762 17185


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature