Re: [Condor-users] Automate removal of inefficient jobs

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

On Tuesday, July 12, 2011 at 11:05 AM, Sarah Williams wrote:

Hi Condor users & experts,

I have one user on my cluster whose jobs are usually well-behaved, but
sometimes stall on contacting a remote server. I manually kill those
jobs when I notice them, but I'd like to get that automated. The
typical sign of a stalled job is one that has >1hr of walltime, and
<1min of cputime.

Is there a way to have condor automatically remove these jobs?

This is a touch tricky. I'm not sure how you get cumulative user + sys CPU for a job for *just* the current run. But you can see the cumulative user + sys CPU numbers for all the times a job has run. Plus the wallclock time for all runs.

So if you wanted to do this for just a particular job, you'd add to its submit ticket something like:

periodic_remove = (JobStatus == 2) && (CurrentTime - EnteredCurrentStatus > 3600) && ((RemoteSysCpu + RemoteUserCpu) < 61)

Or:

periodic_remove = (RemoteWallClockTime > 3600) && ((RemoteSysCpu + RemoteUserCpu) < 61)

The first one says: remove this job if it's current been running for greater than one hour and the total sys+user CPU time it's managed to accumulate across all its run attempts is less than 61 seconds.

The second one says: remove this job if it's accumulated more than an hour of remote run time but has less than 61 seconds of remote sys+user CPU time.

They're slightly different but mostly what you want.

If you wanted these settings to apply to all jobs submitted to a scheduler you could add:

SYSTEM_PERIOCID_REMOVE = <_expression_>

To the condor_config.local for the scheduler machine and reconfigure the scheduler. Then all jobs submitted to that scheduler would be subject to this removal _expression_.

- Ian

---

Ian Chesal

Cycle Computing, LLC

Leader in Open Compute Solutions for Clouds, Servers, and Desktops

Enterprise Condor Support and Management Tools

http://www.cyclecomputing.com

http://www.cyclecloud.com

http://twitter.com/cyclecomputing

Mailing List Archives

Authenticated access

Re: [Condor-users] Automate removal of inefficient jobs