On Tuesday, July 12, 2011 at 11:05 AM, Sarah Williams wrote:
Hi Condor users& experts,
I have one user on my cluster whose jobs are usually well-behaved, but
sometimes stall on contacting a remote server. I manually kill those
jobs when I notice them, but I'd like to get that automated. The
typical sign of a stalled job is one that has>1hr of walltime, and
<1min of cputime.
Is there a way to have condor automatically remove these jobs?
This is a touch tricky. I'm not sure how you get cumulative user + sys
CPU for a job for *just* the current run. But you can see the cumulative
user + sys CPU numbers for all the times a job has run. Plus the
wallclock time for all runs.
So if you wanted to do this for just a particular job, you'd add to its
submit ticket something like:
periodic_remove = (JobStatus == 2)&& (CurrentTime -
EnteredCurrentStatus> 3600)&& ((RemoteSysCpu + RemoteUserCpu)< 61)
Or:
periodic_remove = (RemoteWallClockTime> 3600)&& ((RemoteSysCpu +
RemoteUserCpu)< 61)
The first one says: remove this job if it's current been running for
greater than one hour and the total sys+user CPU time it's managed to
accumulate across all its run attempts is less than 61 seconds.
The second one says: remove this job if it's accumulated more than an
hour of remote run time but has less than 61 seconds of remote sys+user
CPU time.
They're slightly different but mostly what you want.
If you wanted these settings to apply to all jobs submitted to a
scheduler you could add:
SYSTEM_PERIOCID_REMOVE =<expression>
To the condor_config.local for the scheduler machine and reconfigure the
scheduler. Then all jobs submitted to that scheduler would be subject to
this removal expression.
- Ian