Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Job rescheduling
- Date: Mon, 17 Aug 2009 09:55:52 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [Condor-users] Job rescheduling
Matthew Farrellee wrote:
If you're seeing a 2 hour timeout that sounds fairly familiar. I
believe Todd answered it previously. I'd assume his answer was to
reverse the direction on the alive messages. I'll ping him to include
details.
Here is what we think can be done with the current Condor binaries to
address the problem :
Set in the condor_config on **both** the submit machines (running the
condor_schedds) AND the execute machines (running the condor_startds)
the following setting:
STARTD_SENDS_ALIVES = True
Then do a condor_reconfig as usual to both submit and execute machines
(or a condor_reconfig -all). Note that the default setting for this
parameter is False, so if it is not specified in the config it is False.
Unfortunately, Condor will not (yet) gracefully handle the situation
where the value is different on the submit -vs- execute machines.
Upon doing the above, your job ClassAds will contain an attribute
"LastJobLeaseRenewal" which will contain an integer representing the
epoch time (number of seconds since 1/1/1970) since it last heard from
the startd on the execute machine.
So in your job submit description file (which you give to
condor_submit), you could add the following:
PeriodicHold = JobLeaseDuration =!= UNDEFINED && \
((JobLeaseDuration - (CurrentTime - LastJobLeaseRenewal)) <= 0 )
PeriodicRelease = PeriodicHold =?= True
The above says that if the job has a job lease, and the lease has
expired, put the job on hold, thereby move it from Running state to Hold
state. Then the periodic release expression says if the lease is
expired (ergo the PeriodicHold expression is true), then release the job
from Hold state back to Idle state -- at which point it will be
rescheduled someplace else. Note you can use SUBMIT_EXPRS (see Manual)
to have condor_submit automatically add the above policy into every job
submitted.
Let us know how the above suggestions go.
In a future release of Condor, we wish to do the following:
a) make STARTD_SENDS_ALIVE default to True
b) have the schedd automatically move a job with an expired lease
from Running back to idle the moment the lease expires, without
requiring the user to utilize the periodic hold/release expressions, and
the polling delay the use of these expressions introduces (the schedd
only periodically evaluates the periodic expressions).
--
Todd Tannenbaum University of Wisconsin-Madison
Condor Project Research Department of Computer Sciences
tannenba@xxxxxxxxxxx 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685