Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Question about defragmentation
- Date: Wed, 19 Nov 2025 09:22:58 +0100
- From: Jeff Templon <templon@xxxxxxxxx>
- Subject: Re: [HTCondor-users] Question about defragmentation
Hi Todd,
Thanks for this long and informative message. I think I learned something I did not realize about the defrag daemon, namely that the whole node is in the partitionable slot. Did I understand it correctly?
I had thought, once there was some 8-core slot on the machine (one of the dynamic slots for example) then defrag would decide more draining was not needed, so I would never make it to a fully drained machine. With the new understanding, one of the problems disappears, and with my increasing understanding of the quota/priority dance, I am starting to worry a little less about that problem.
Thanks again, Iâll let you know how it goes. See the next message btw :)
JT
> On 19 Nov 2025, at 00:19, Todd L Miller via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
>
> You've described the mechanism you'd like to use quite clearly, and correctly observed that HTCondor doesn't support it directly. However, I'm less clear on what plain-English policy you're attempting to implement, without which it's difficult to suggest good solutions.
>
> That's the most-important part of this message. The rest of it goes on -- for a while -- about my guess as to what you want, why it the pool isn't already behaving that way, the built-in solution HTCondor comes with, why it can't be configured to operate how you'd expect, and a hack to maybe make things work anyway.
>
> It sounds like you may want the number of running multi-core jobs to be directly proportional to the number of multi-core jobs in the queue (up to a limit where at most half of the pool's cores are running multi-core jobs).* That amounts to a strong bias in favor of multi-core jobs; the more cores per job, the stronger the bias. In terms of determining which submitter's jobs are matched for next, that might mean changing the default weight assigned to jobs to ignore core count.
>
> Assuming an even (unweighted by core count) mix of jobs in the queue, your pool "should" trend towards an even (weighted by core count) mix of jobs running. In practice, it probably won't, because when a
> multi-core job exits, there's a certain chance that the next job examined from a submitter who won the "quota/priority dance" won't be a multi-core job, but there's a much smaller chance that enough single-core
> jobs will exit at the same time to provide enough cores for a multi-core
> job. This is a problem, of course, because the mutlti-core slot, once it's lost one core to run a single-core job, will then be used to run seven more single-core jobs which won't all exit at the same time. Over
> time, then, you would expect a pool that had been divided 50/50 between single- and multi- core slots to become 100% single-core slots.
>
> This is exactly the scenario that the defrag daemon is intended to deal with. I think another way of saying what you want is to say that no more thean 50% of the EPs in the pool should be willing to waste time making sure that their idle cores are only used by multi-core jobs. (This is equivalent because HTCondor sorts jobs by submitter before it sorts them by any other category, when considering which job(s) to match first.) Suppose you're willing to have each EP wait a full minute for a multi-core job to match: you can then write a START expression that reflects that:
>
> # Something like this.
> START = (TARGET.RequestCpus >= 8) || ((time() - EnteredCurrentState) > 60)
>
> Of course, this doesn't help if there aren't any multi-core slots in the pool because there haven't been any multi-core jobs for a while, and that's where the defrag daemon comes in.
>
>
> The defrag daemon will let a machine drain until
> `DEFRAG_WHOLE_MACHINE_EXPR` evaluates to true, so if your only concern is 8-core jobs, you should set it accordingly.
>
> # Something like this.
> DEFRAG_WHOLE_MACHINE_EXPR = Cpus >= 8
>
> You can set `DEFRAG_MAX_WHOLE_MACHINES` so that only half of your machines will will drain at any given time:
>
> # If you have 100 machines in your pool.
> DEFRAG_MAX_WHOLE_MACHINES = 50
>
> Allow yourself to drain as many machines as it takes:
>
> # This is deliberately way higher than the actual cap.
> DEFRAG_DRAINING_MACHINES_PER_HOUR = 999999
>
> If you want the lowest-possible latency for multicore jobs (without reserving slots), you'll want to force slots to be renegotiated after each job. This will cost you quite a bit of extra time negotiating and reduce your overall throughput, so you may not want to do this right away; on the other hand, if you leave the defrag daemon running all the time, it will might save you quite a bit of lost time.
>
> # Don't ever re-use a slot.
> CLAIM_WORKLIFE = 0
>
>
> So the question is, when should the defrag daemon be running?
>
>> this control seems to be missing.
>>
>> How do others approach this? Is there some key concept Iâve misunderstood or missed?
>
> To my understanding, our local experience is that there are always a mix of jobs in the queue, and so it's appropriate to have a continuous defragmentation policy. (In other pools, nodes come and go all the time, so no explicit defragmentation is necessary.) It isn't ideal, but you should be able to turn the defrag daemon on and off with `condor_[on|off] -daemon defrag`, so you could have a little script running on the side (perhaps as a schedd cron job) that looks at the queue and decides what to do. (One option is to adjust DEFRAG_MAX_WHOLE_MACHINES depending on how many jobs are in the queue, I suppose.)
>
> -- ToddM_______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
>
> The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/