Hi,
On 27/08/21 06:20, David Cohen wrote:
Hooray!!
It's working now and job's running over time are evicted.
Now to my next project, holding jobs that after 30 minutes
of run still don't use more than 10% of the requested memory:
WastingMemory = (JobStatus == 2 && (time() -
JobCurrentStartExecutingDate) > 1800) &&
(RequestMemory > 8192) && (ResidentSetSize/1024
< RequestMemory/10)
I believe that thread gives me all the tools needed to
manage that one.
Experts here might want to confirm: i think that some job classads
(such as ResidentSetSize) are actually updated every 15 minutes.
If that is true, that means that this policy could put on hold a job
now, based on a value measured up to 15 minutes before.
A simple remedy would be that of waiting 2700 seconds instead of
1800.
When considering a hold policy, i use condor_q to check for
candidate jobs, and verify that no "innocent" jobs are involved.
Running something like this or a variant:
condor_q -glob -all -cons '(JobStatus == 2 && (time() -
JobCurrentStartExecutingDate) > 1800)' -af:j owner
'(RequestMemory > 8192)' '(ResidentSetSize < RequestMemory *
102.4)
'
Â
Could help to confirm that the right jobs are affected before
enforcing the rule.
Stefano
On
26/08/21 15:12, Stefano Dal Pra wrote:
> [SNIP]
>>
>> That works perfectly for MEMORY_EXCEEDED but totally
ignored for
>> TIME_EXCEEDED.
[SNIP]
I stumbled on a somehow survived job running for 21 days, so i
forged a
clause to get it held and verify that it works:
TooMuchTime = (jobstatus == 2 && (time() -
JobStartDate > 86400 * 7))
This clause works, but it only takes effect after condor
restart:
condor_reconfig not enough.
Stefano