Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] jobs surviving periodic_hold condition
- Date: Mon, 22 Aug 2022 13:32:27 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] jobs surviving periodic_hold condition
Hi Stefano,
Hope all is well, and perhaps I will see you in Oct!
Re the below, I had an idea about what may be happening:
Currently, the rounding of DiskUsage (and ResidentSetSize) is
performed only by the condor_schedd. When DiskUsage is updated in
the schedd's copy of the job classad, DiskUsage (by default) is
rounded up and the non-rounded value is copied into DiskUsage_RAW.
So using DiskUsage_RAW in your SYSTEM_PERIODIC_HOLD _expression_ is
fine unless the job is actually in Running state. The reason is
these job policy expressions are evaluated by the condor_shadow
when the job is running, and only evaluated by the scehdd while
the job is idle. The job classad in the shadow will not have
DiskUsage_RAW updated while the job is running, which likely
explains the behavior you are observing below.
I am thinking it may be good for us to modify the condor_shadow
source code such that it also updated the _RAW values to avoid the
below problem.
In the meantime, assuming I am guessing the problem correctly,
some ideas on how you could work around the issue below:
1. If you control your execution points (EPs), starting with
HTCondor v9.11.0, you could have the condor_startd do the checking
for DiskUsage by adding the following into the config of your EPs:
use policy: HOLD_IF_DISK_EXCEEDED
2. Instead of using "DiskUsage_RAW" in your SYSTEM_PERIODIC_HOLD
_expression_, use something like "jobstatus == 2 ? DiskUsage :
DiskUsage_RAW".
3. You could just go with "DiskUsage" in your _expression_, which
has the (small) downside that perhaps the job will go on hold if
the request_disk is very close to the actual usage.
Hope the above helps,
Todd
On 8/22/2022 4:57 AM, Stefano Dal Pra wrote:
Hello,
condor 9.0.13 here.
We observe running jobs that should have been put on hold by the
schedd for using too much disk space.
The SYSTEM_PERIODIC_HOLD
clause is
SYSTEM_PERIODIC_HOLD
= $(SYSTEM_PERIODIC_HOLD:False) || $(SecondStart) ||
$(TooMuchDisk) || $(TooMuchRSS) || $(TooMuchTime)
And the conditions are:
SecondStart =
(NumJobStarts == 1 && JobStatus == 1)
TooMuchDisk = (DiskUsage_raw > 35 * (CpusProvisioned ?:
RequestCpus) * 1024000)
TooMuchRSS = (ResidentSetSize_RAW > 40 * (CpusProvisioned ?:
RequestCpus) * 1e6 )
TooMuchTime = (jobstatus == 2 && (time() -
JobStartDate > 86400 * 7))
This usually works but there are jobs at times that survive
after going over the TooMuchDisk
condition:
[root@ce03-htc
~]# condor_q -all -cons 'jobstatus == 2 &&
DiskUsage_RAW/1e6 > 40 * CpusProvisioned' -af:j owner
scheddhostname CpusProvisioned 'split(remotehost ?:
lastremotehost,".@")[1]' 'DiskUsage_RAW/1e6'
'ImageSize_RAW/1e6' 'interval(time()-jobstartdate)' | sort -n -k
6
7968490.0 pilatlas003 ce03-htc 1 wn-204-13-09-06-a 40.236304
8.760128 1+16:42:51
7968484.0 pilatlas003 ce03-htc 1 wn-204-11-05-04-a 46.117129
8.7592 1+16:42:52
7968498.0 pilatlas003 ce03-htc 1 wn-205-13-01-07-a 47.892963
7.084336 1+16:42:51
7979553.0 pilatlas003 ce03-htc 1 cn-313-06-02 76.593148
8.441587999999999 2:15:05
7979600.0 pilatlas003 ce03-htc 1 cn-313-06-05 76.431658 6.094996
2:03:12
7979204.0 pilatlas003 ce03-htc 1 wn-200-11-11-01-a
98.61471400000001 4.884384 3:32:57
7979205.0 pilatlas003 ce03-htc 1 cn-313-06-08 150.625147
33.13694 3:32:19
Here job 7979205.0 is
taking 150.6GB disk space.
I verified that after a condor_restart the shadow for these jobs
detect their condition and put them on hold:
[root@ce02-htc
~]# systemctl restart condor && tail -f
/var/log/condor/SchedLog | egrep
'8039630.0|8063116.0|8064391.0|8064828.0|8066107.0'
08/22/22
11:17:20 (pid:3385936) Starting add_shadow_birthdate(8039630.0)
[...]
08/22/22
11:18:22 (pid:3385936) Shadow pid 3391245 for job 8039630.0 exited with
status 112
08/22/22 11:18:22 (pid:3385936) Putting job 8039630.0 on hold
[root@ce02-htc
~]# condor_q 8039630.0 -af holdreason
pilatlas002, TooMuchDisk: 35GB/core
This makes me think that somehow the shadow process might fail
at detecting a condition once and for all the next attempts.
I think that the TooMuchDisk
is not the only one, and this can also happen with other
checks (TooMuchRSS,
for example).
Is there some way to force a "PERIODIC_HOLD
recheck" for a particular job, or any other suggested
check?
Thanks
Stefano
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/