Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] portionable slots and greedy users
- Date: Fri, 6 Oct 2023 13:44:28 -0500 (CDT)
- From: Todd L Miller <tlmiller@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] portionable slots and greedy users
If you can't trust your users, hopefully the condor folks can offer a
workable solution. I can't think of an easy way out, sorry.
The first post is in the thread said that the long jobs were also
all characterized by having high memory requirements. You could write a
submit transform that matches whatever "high memory" means in this context
and inserts a concurrency limit. See
https://htcondor.readthedocs.io/en/latest/admin-manual/setting-up-special-environments.html#concurrency-limits
for details, but the idea is you set the maximum number of concurrently
running "high memory" jobs so that they can only use 95% of the pool.
(Maybe aim for 75% first and increase the limit as necessary? 95%
doesn't have a lot of slop...)
That's if you want to reserve 5% of the pool for non-high-memory
jobs. This will waste capacity if you don't have "enough" of such jobs,
but should do a good job of ensuring small scheduling delays. If you
instead want the share of the pool's time spent running short jobs over
(roughly) the whole day to be 5%, you can use the same trick, but with
accounting groups instead of concurrency limits.
You can also empirically determine shortness. Write submit
transform that sets allowed_job_duration (or allowed_execute_duration as
appropriate) to 300 and a periodic_release which removes the hold
automatically. The periodic_release can't change the value of
allowed_job_duration, but you can probably say something like:
allowed_job_duration = if( NumHolds == 0, 300, undefined )
instead of just "300". HoldReasonCode 46 (or 47) is reserved for
allowed_job_duration (or allowed_execute_duration) being exceeded, so the
periodic_release expression should be easy to write.
-- ToddM