Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] slot cool down time
- Date: Mon, 23 Aug 2021 17:29:21 +0200
- From: Stefano Dal Pra <stefano.dalpra@xxxxxxxxxxxx>
- Subject: Re: [HTCondor-users] slot cool down time
On 23/08/21 16:39, Beyer, Christoph wrote:
Hi,
I would like a certain type of slots to cool down after a finished or failed job. Reason is these are interactive jobs and I want the 2nd try of the user (complete new job) not to run on the same node that did not succeed before.
In my tiny brain I thought that something like:
START = $(START) && ((time() - EnteredCurrentState) > 600)
Would do the trick just fine and effectly cause a 10 minute waiting time but apparently it does not - any suggestions ?
Best
christoph
Hello Christoph,
my understanding is that EnteredCurrentState
is the job classad of a "not yet started job", so that clause would
match jobs pending for more than 10 mins.
I had a similar problem (did not want a singlecore start immediately
soon after a multicore left).
I defined a custom machine classad "MC_GRACE" whose numeric value is set
by a startd cronjob running at the node;
In this case, the idea is that the cronjob (running every min) keeps a
numeric value (say x) set to 1.0 while a given condition is False
(example: a job of Owner "foo" is running).
When it is not, it scales the x value; for example:ÂÂ x <-- 0.75 * x
then you set the machine classad as, say: OKJOB = (x < 0.1).
In this case the OKJOB would become True 9 minutes after the condition
who keeps x at 1.0 has gone.
Hope this idea could help, eventhought my explanation could be a bit
confused.
Stefano