Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Is ExpectedMachineGracefulDrainingBadput the sum of subslots and related defrag questions

Date: Tue, 14 Sep 2021 13:44:21 +0200
From: Carsten Aulbert <carsten.aulbert@xxxxxxxxxx>
Subject: [HTCondor-users] Is ExpectedMachineGracefulDrainingBadput the sum of subslots and related defrag questions

Hi again,

we were still testing some "sensible" condor_defrag approach and wouldlike to get some feedback.

Right now, we don't run condor_defrag as most of our jobs don't give usa handle on how long they run and if they perform internalcheckpointing, but on the other hand, we start accumulating idle jobswhich require the maximum possible number of CPU cores. Thus, we want toadd condor_defrag somewhat safely to our large CPU core count machines:

The current plan would be for users to declare up to two extra items intheir submit files:


# short running job and/or internal checkpointing
+KillableJob = true/false
# estimated max run time
+ExectedRuntimeHours = num

and on the startd site, we would set for the majority of nodes

# this ought to work even if ExpectedRuntimeHours were undefined, right?
START = $(START) && (KillableJob =?= true || ExpectedRuntimeHours <= 6)
MaxJobRetirementTime = 6 * 3600
MachineMaxVacateTime = 150

and for a much smaller share of the pool we would set
START = true

MaxJobRetirementTime = min( {ifthenelse(isUndefined(ExpectedRuntimeHours), 24, ExpectedRunTimeHours),24*14 } ) * 3600

MachineMaxVacateTime = 150

This should effectively steer longer running jobs to the few machineswhile allowing shorter running jobs without checkpointing feature orcheckpointable/killable jobs to run everywhere.


On the defrag side, we would set something like
DAEMON_LIST = $(DAEMON_LIST) DEFRAG
DEFRAG_INTERVAL = 1800
DEFRAG_DRAINING_MACHINES_PER_HOUR = 6.0
DEFRAG_MAX_CONCURRENT_DRAINING = 10

DEFRAG_REQUIREMENTS = PartitionableSlot && Offline =!= True && TotalCpus>= 50

DEFRAG_DRAINING_START_EXPR = (KillableJob =?= true)
DEFRAG_UPDATE_INTERVAL = 300

DEFRAG_WHOLE_MACHINE_EXPR = PartitionableSlot && Cpus == TotalSlotCpus&& Offline =!= True && TotalCpus >= 50

DEFRAG_SCHEDULE = graceful

This should allow "killable" jobs to be scheduled there and prevent toomuch badput while waiting for the machines to finish defrag.


What we are not really certain about if the default

DEFRAG_RANK = -ExpectedMachineGracefulDrainingBadput

is taking all subslots into account (by summing these?) or the manual's"slot" really means the highest ranking subslot - we only run a singlepartition-able slot per host machine.

Also, while I expect DEFRAG_RANK to mostly steer condor_defrag to themachines with lower MaxJobVacateTime should we worry aboutDEFRAG_MAX_CONCURRENT_DRAINING = 10 if we have many more than 10 of thesecond kind of machines defined? If so, any idea which handle to use toensure a good turn-around time?

Testing on a two node set-up looked good so far, but I wanted to gathersome feedback here, to ensure that this makes any sense or if we missedsomething vital/obvious.


Have we?

Cheers and thanks a lot in advance!

Carsten

--
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany, Phone +49 511 762 17185

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Follow-Ups:
- Re: [HTCondor-users] Is ExpectedMachineGracefulDrainingBadput the sum of subslots and related defrag questions
  - From: Todd L Miller

Prev by Date: Re: [HTCondor-users] Selecting or partitioning GPUs
Next by Date: Re: [HTCondor-users] Condor RemoteUserCpu and RemoteSysCpu ClassAdds
Previous by thread: Re: [HTCondor-users] Selecting or partitioning GPUs
Next by thread: Re: [HTCondor-users] Is ExpectedMachineGracefulDrainingBadput the sum of subslots and related defrag questions
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[HTCondor-users] Is ExpectedMachineGracefulDrainingBadput the sum of subslots and related defrag questions