Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Automate removal of inefficient jobs
- Date: Tue, 12 Jul 2011 17:05:53 -0400
- From: Sarah Williams <saewill@xxxxxxxxx>
- Subject: Re: [Condor-users] Automate removal of inefficient jobs
Hi all,
I used condor_q to test this statement, and it selects the correct jobs.
condor_q -constraint ' User =?= "user1@xxxxxxxxxxxxxxx" && (JobStatus
== 2) && (CurrentTime - EnteredCurrentStatus > 3600) && ((RemoteSysCpu +
RemoteUserCpu) < 61)'
So, I set SYSTEM_PERIODIC_REMOVE equal to that value on the schedd host,
verified it with condor_config_val, and waited. But, it does not seem to
be removing the jobs. The ScheddLog does not have any unusual entries.
I tried wrapping the statement with debug(), but no debug messages are
printed to the log. Also tried SCHEDD_DEBUG = D_FULLDEBUG D_COMMAND,
but there were no messages about periodic_remove in the output.
Tomorrow I will try setting periodic_remove per job and see if that
works ....
--Sarah
On 7/12/11 1:23 PM, Matthew Farrellee wrote:
> That's good stuff. Remember you can try it out by just running...
>
> condor_q -constraint '(JobStatus == 2)&& (CurrentTime -
> EnteredCurrentStatus> 3600)&& ((RemoteSysCpu + RemoteUserCpu)< 61)'
>
> ...to get a list of jobs that would be removed.
>
> Best,
>
>
> matt
>
> On 07/12/2011 11:31 AM, Sarah Williams wrote:
>> Hi Ian,
>>
>> Thanks, I will start from what you've suggested and let you know how it
>> goes. One thing I am unclear on, by current run you mean a job that has
>> been held and then restarted?
>>
>> --Sarah
>>
>> On 7/12/11 11:20 AM, Ian Chesal wrote:
>>> On Tuesday, July 12, 2011 at 11:05 AM, Sarah Williams wrote:
>>>> Hi Condor users& experts,
>>>>
>>>> I have one user on my cluster whose jobs are usually well-behaved, but
>>>> sometimes stall on contacting a remote server. I manually kill those
>>>> jobs when I notice them, but I'd like to get that automated. The
>>>> typical sign of a stalled job is one that has>1hr of walltime, and
>>>> <1min of cputime.
>>>>
>>>> Is there a way to have condor automatically remove these jobs?
>>> This is a touch tricky. I'm not sure how you get cumulative user + sys
>>> CPU for a job for *just* the current run. But you can see the cumulative
>>> user + sys CPU numbers for all the times a job has run. Plus the
>>> wallclock time for all runs.
>>>
>>> So if you wanted to do this for just a particular job, you'd add to its
>>> submit ticket something like:
>>>
>>> periodic_remove = (JobStatus == 2)&& (CurrentTime -
>>> EnteredCurrentStatus> 3600)&& ((RemoteSysCpu + RemoteUserCpu)< 61)
>>>
>>> Or:
>>>
>>> periodic_remove = (RemoteWallClockTime> 3600)&& ((RemoteSysCpu +
>>> RemoteUserCpu)< 61)
>>>
>>> The first one says: remove this job if it's current been running for
>>> greater than one hour and the total sys+user CPU time it's managed to
>>> accumulate across all its run attempts is less than 61 seconds.
>>>
>>> The second one says: remove this job if it's accumulated more than an
>>> hour of remote run time but has less than 61 seconds of remote sys+user
>>> CPU time.
>>>
>>> They're slightly different but mostly what you want.
>>>
>>> If you wanted these settings to apply to all jobs submitted to a
>>> scheduler you could add:
>>>
>>> SYSTEM_PERIOCID_REMOVE =<expression>
>>>
>>> To the condor_config.local for the scheduler machine and reconfigure the
>>> scheduler. Then all jobs submitted to that scheduler would be subject to
>>> this removal expression.
>>>
>>> - Ian
>>>