Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] startd job count limit to limit the damage of black holes

Date: Mon, 28 Aug 2017 17:36:59 -0400
From: "Betts, Wayne" <wbetts@xxxxxxx>
Subject: [HTCondor-users] startd job count limit to limit the damage of black holes

Hello Condor Community,

Is there any way to have startd only start N jobs and then stop matchingany more?Â For instance, I often want N=1 so that only one job canexecute on a new machine added to a cluster, though I can imagine othervalues of N might also be of use in some cases. A mis-configuration of anew node all too often causes jobs to fail quickly, so another jobstarts and fails and so on, thus creating a black hole, quickly drainingour queue without doing anything useful.Â Initially limiting the totalnumber of started jobs to 1 until the node is shown to successfully runour jobs would help me tremendously.Â Something like

START = (TotalJobsStarted < 2)Â # where TotalJobsStarted is the missingpiece that I've yet to find, so am seeking your help.

A different approach might be to add in a lengthy delay between the timea job finishes and the time another job is started.Â With NUM_SLOTS = 1and a few minutes delay between a job's immediate failure (which condoronly sees as a successful completion) and a new job starting, I couldmanually detect the failure of a job and shutdown condor on the blackhole node until I figure out the cause of the failure and try again.ÂThe submit option "keep_claim_idle" looks like it does something likethis, but is generally undesirable, and I'd rather have something likethis on the startd side, rather than on the submit side.Â Is there suchan option/classad for startd?ÂÂ (It wasn't clear to me if settingJOB_START_DELAY to a large value would do the trick, so I tried it andit did not help).

Btw, I foundhttps://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=AvoidingBlackHoles,but I don't see how it helps, since I'd rather not drain the queue outcompletely in the first place.Â If a single job fails, our submissionsystem will (eventually) detect it, and it will be resubmitted withoutany significant loss, but if the entire queue is emptied because of allidle jobs going to the black hole, then we start losing CPU cycles.


Thank you for your time,

-Wayne

Follow-Ups:
- Re: [HTCondor-users] startd job count limit to limit the damage of black holes
  - From: Greg Thain
- Re: [HTCondor-users] startd job count limit to limit the damage of black holes
  - From: Michael Pelletier

Prev by Date: [HTCondor-users] BOSCO question
Next by Date: Re: [HTCondor-users] startd job count limit to limit the damage of black holes
Previous by thread: Re: [HTCondor-users] BOSCO question
Next by thread: Re: [HTCondor-users] startd job count limit to limit the damage of black holes
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[HTCondor-users] startd job count limit to limit the damage of black holes