Hi Condor users & experts,
I have one user on my cluster whose jobs are usually well-behaved, but
sometimes stall on contacting a remote server. I manually kill those
jobs when I notice them, but I'd like to get that automated. The
typical sign of a stalled job is one that has >1hr of walltime, and
<1min of cputime.
Is there a way to have condor automatically remove these jobs?
This is a touch tricky. I'm not sure how you get cumulative user + sys CPU for a job for *just* the current run. But you can see the cumulative user + sys CPU numbers for all the times a job has run. Plus the wallclock time for all runs.
So if you wanted to do this for just a particular job, you'd add to its submit ticket something like:
periodic_remove = (JobStatus == 2) && (CurrentTime - EnteredCurrentStatus > 3600) && ((RemoteSysCpu + RemoteUserCpu) < 61)
Or:
periodic_remove = (RemoteWallClockTime > 3600) && ((RemoteSysCpu + RemoteUserCpu) < 61)
The first one says: remove this job if it's current been running for greater than one hour and the total sys+user CPU time it's managed to accumulate across all its run attempts is less than 61 seconds.
The second one says: remove this job if it's accumulated more than an hour of remote run time but has less than 61 seconds of remote sys+user CPU time.
They're slightly different but mostly what you want.
If you wanted these settings to apply to all jobs submitted to a scheduler you could add:
SYSTEM_PERIOCID_REMOVE = <_expression_>
To the condor_config.local for the scheduler machine and reconfigure the scheduler. Then all jobs submitted to that scheduler would be subject to this removal _expression_.
- Ian
---
Ian Chesal
Cycle Computing, LLC
Leader in Open Compute Solutions for Clouds, Servers, and Desktops
Enterprise Condor Support and Management Tools
http://www.cyclecomputing.com
http://www.cyclecloud.com
http://twitter.com/cyclecomputing