Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] startd job count limit to limit the damage of black holes
- Date: Mon, 28 Aug 2017 17:36:59 -0400
- From: "Betts, Wayne" <wbetts@xxxxxxx>
- Subject: [HTCondor-users] startd job count limit to limit the damage of black holes
Hello Condor Community,
Is there any way to have startd only start N jobs and then stop matching
any more? For instance, I often want N=1 so that only one job can
execute on a new machine added to a cluster, though I can imagine other
values of N might also be of use in some cases. A mis-configuration of a
new node all too often causes jobs to fail quickly, so another job
starts and fails and so on, thus creating a black hole, quickly draining
our queue without doing anything useful. Initially limiting the total
number of started jobs to 1 until the node is shown to successfully run
our jobs would help me tremendously. Something like
START = (TotalJobsStarted < 2)Â # where TotalJobsStarted is the missing
piece that I've yet to find, so am seeking your help.
A different approach might be to add in a lengthy delay between the time
a job finishes and the time another job is started. With NUM_SLOTS = 1
and a few minutes delay between a job's immediate failure (which condor
only sees as a successful completion) and a new job starting, I could
manually detect the failure of a job and shutdown condor on the black
hole node until I figure out the cause of the failure and try again.Â
The submit option "keep_claim_idle" looks like it does something like
this, but is generally undesirable, and I'd rather have something like
this on the startd side, rather than on the submit side. Is there such
an option/classad for startd?ÂÂ (It wasn't clear to me if setting
JOB_START_DELAY to a large value would do the trick, so I tried it and
it did not help).
Btw, I found
https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=AvoidingBlackHoles,
but I don't see how it helps, since I'd rather not drain the queue out
completely in the first place. If a single job fails, our submission
system will (eventually) detect it, and it will be resubmitted without
any significant loss, but if the entire queue is emptied because of all
idle jobs going to the black hole, then we start losing CPU cycles.
Thank you for your time,
-Wayne