Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Question about defragmentation
- Date: Tue, 18 Nov 2025 17:19:08 -0600 (CST)
- From: Todd L Miller <tlmiller@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Question about defragmentation
You've described the mechanism you'd like to use quite clearly,
and correctly observed that HTCondor doesn't support it directly.
However, I'm less clear on what plain-English policy you're attempting to
implement, without which it's difficult to suggest good solutions.
That's the most-important part of this message. The rest of it
goes on -- for a while -- about my guess as to what you want, why it the
pool isn't already behaving that way, the built-in solution HTCondor comes
with, why it can't be configured to operate how you'd expect, and a hack
to maybe make things work anyway.
It sounds like you may want the number of running multi-core jobs
to be directly proportional to the number of multi-core jobs in the queue
(up to a limit where at most half of the pool's cores are running
multi-core jobs).* That amounts to a strong bias in favor of multi-core
jobs; the more cores per job, the stronger the bias. In terms of
determining which submitter's jobs are matched for next, that might mean
changing the default weight assigned to jobs to ignore core count.
Assuming an even (unweighted by core count) mix of jobs in the
queue, your pool "should" trend towards an even (weighted by core count)
mix of jobs running. In practice, it probably won't, because when a
multi-core job exits, there's a certain chance that the next job
examined from a submitter who won the "quota/priority dance" won't be a
multi-core job, but there's a much smaller chance that enough single-core
jobs will exit at the same time to provide enough cores for a multi-core
job. This is a problem, of course, because the mutlti-core slot, once
it's lost one core to run a single-core job, will then be used to run
seven more single-core jobs which won't all exit at the same time. Over
time, then, you would expect a pool that had been divided 50/50 between
single- and multi- core slots to become 100% single-core slots.
This is exactly the scenario that the defrag daemon is intended to
deal with. I think another way of saying what you want is to say that no
more thean 50% of the EPs in the pool should be willing to waste time
making sure that their idle cores are only used by multi-core jobs. (This
is equivalent because HTCondor sorts jobs by submitter before it sorts
them by any other category, when considering which job(s) to match first.)
Suppose you're willing to have each EP wait a full minute for a multi-core
job to match: you can then write a START expression that reflects that:
# Something like this.
START = (TARGET.RequestCpus >= 8) || ((time() - EnteredCurrentState) > 60)
Of course, this doesn't help if there aren't any multi-core slots
in the pool because there haven't been any multi-core jobs for a while,
and that's where the defrag daemon comes in.
The defrag daemon will let a machine drain until
`DEFRAG_WHOLE_MACHINE_EXPR` evaluates to true, so if your only concern is
8-core jobs, you should set it accordingly.
# Something like this.
DEFRAG_WHOLE_MACHINE_EXPR = Cpus >= 8
You can set `DEFRAG_MAX_WHOLE_MACHINES` so that only half of your
machines will will drain at any given time:
# If you have 100 machines in your pool.
DEFRAG_MAX_WHOLE_MACHINES = 50
Allow yourself to drain as many machines as it takes:
# This is deliberately way higher than the actual cap.
DEFRAG_DRAINING_MACHINES_PER_HOUR = 999999
If you want the lowest-possible latency for multicore jobs
(without reserving slots), you'll want to force slots to be renegotiated
after each job. This will cost you quite a bit of extra time negotiating
and reduce your overall throughput, so you may not want to do this right
away; on the other hand, if you leave the defrag daemon running all the
time, it will might save you quite a bit of lost time.
# Don't ever re-use a slot.
CLAIM_WORKLIFE = 0
So the question is, when should the defrag daemon be running?
this control seems to be missing.
How do others approach this? Is there some key concept Iʼve
misunderstood or missed?
To my understanding, our local experience is that there are always
a mix of jobs in the queue, and so it's appropriate to have a continuous
defragmentation policy. (In other pools, nodes come and go all the time,
so no explicit defragmentation is necessary.) It isn't ideal, but you
should be able to turn the defrag daemon on and off with `condor_[on|off]
-daemon defrag`, so you could have a little script running on the side
(perhaps as a schedd cron job) that looks at the queue and decides what
to do. (One option is to adjust DEFRAG_MAX_WHOLE_MACHINES depending on
how many jobs are in the queue, I suppose.)
-- ToddM