[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] portionable slots and greedy users



If you can't trust your users, hopefully the condor folks can offer a workable solution. I can't think of an easy way out, sorry.

The first post is in the thread said that the long jobs were also all characterized by having high memory requirements. You could write a submit transform that matches whatever "high memory" means in this context and inserts a concurrency limit. See

https://htcondor.readthedocs.io/en/latest/admin-manual/setting-up-special-environments.html#concurrency-limits

for details, but the idea is you set the maximum number of concurrently running "high memory" jobs so that they can only use 95% of the pool. (Maybe aim for 75% first and increase the limit as necessary? 95% doesn't have a lot of slop...)

That's if you want to reserve 5% of the pool for non-high-memory jobs. This will waste capacity if you don't have "enough" of such jobs, but should do a good job of ensuring small scheduling delays. If you instead want the share of the pool's time spent running short jobs over (roughly) the whole day to be 5%, you can use the same trick, but with accounting groups instead of concurrency limits.

You can also empirically determine shortness. Write submit transform that sets allowed_job_duration (or allowed_execute_duration as appropriate) to 300 and a periodic_release which removes the hold automatically. The periodic_release can't change the value of allowed_job_duration, but you can probably say something like:

allowed_job_duration = if( NumHolds == 0, 300, undefined )

instead of just "300". HoldReasonCode 46 (or 47) is reserved for allowed_job_duration (or allowed_execute_duration) being exceeded, so the periodic_release expression should be easy to write.

-- ToddM