[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] jobs surviving periodic_hold condition
- Date: Mon, 22 Aug 2022 11:57:20 +0200
- From: Stefano Dal Pra <stefano.dalpra@xxxxxxxxxxxx>
- Subject: [HTCondor-users] jobs surviving periodic_hold condition
Hello,
condor 9.0.13 here.
We observe running jobs that should have been put on hold by the
schedd for using too much disk space.
The SYSTEM_PERIODIC_HOLD
clause is
SYSTEM_PERIODIC_HOLD
= $(SYSTEM_PERIODIC_HOLD:False) || $(SecondStart) ||
$(TooMuchDisk) || $(TooMuchRSS) || $(TooMuchTime)
And the conditions are:
SecondStart =
(NumJobStarts == 1 && JobStatus == 1)
TooMuchDisk ÂÂ= (DiskUsage_raw > 35 * (CpusProvisioned ?:
RequestCpus) * 1024000)
TooMuchRSS = (ResidentSetSize_RAW > 40 * (CpusProvisioned ?:
RequestCpus) * 1e6 )
TooMuchTime ÂÂ= (jobstatus == 2 && (time() - JobStartDate
> 86400 * 7))
This usually works but there are jobs at times that survive after
going over the TooMuchDisk
condition:
[root@ce03-htc
~]# condor_q -all -cons 'jobstatus == 2 &&
DiskUsage_RAW/1e6 > 40 * CpusProvisioned' -af:j owner
scheddhostname CpusProvisioned 'split(remotehost ?:
lastremotehost,".@")[1]' 'DiskUsage_RAW/1e6'
'ImageSize_RAW/1e6' 'interval(time()-jobstartdate)' | sort -n -k 6
7968490.0 pilatlas003 ce03-htc 1 wn-204-13-09-06-a 40.236304
8.760128 1+16:42:51
7968484.0 pilatlas003 ce03-htc 1 wn-204-11-05-04-a 46.117129
8.7592 1+16:42:52
7968498.0 pilatlas003 ce03-htc 1 wn-205-13-01-07-a 47.892963
7.084336 1+16:42:51
7979553.0 pilatlas003 ce03-htc 1 cn-313-06-02 76.593148
8.441587999999999 2:15:05
7979600.0 pilatlas003 ce03-htc 1 cn-313-06-05 76.431658 6.094996
2:03:12
7979204.0 pilatlas003 ce03-htc 1 wn-200-11-11-01-a
98.61471400000001 4.884384 3:32:57
7979205.0 pilatlas003 ce03-htc 1 cn-313-06-08 150.625147 33.13694
3:32:19
Here job 7979205.0 is
taking 150.6GB disk space.
I verified that after a condor_restart the shadow for these jobs
detect their condition and put them on hold:
[root@ce02-htc
~]# systemctl restart condor && tail -f
/var/log/condor/SchedLog | egrep
'8039630.0|8063116.0|8064391.0|8064828.0|8066107.0'
08/22/22
11:17:20 (pid:3385936) Starting add_shadow_birthdate(8039630.0)
[...]
08/22/22
11:18:22 (pid:3385936) Shadow pid 3391245 for job 8039630.0 exited with
status 112
08/22/22 11:18:22 (pid:3385936) Putting job 8039630.0 on hold
[root@ce02-htc
~]# condor_q 8039630.0 -af holdreason
pilatlas002, TooMuchDisk: 35GB/core
This makes me think that somehow the shadow process might fail at
detecting a condition once and for all the next attempts.
I think that the TooMuchDisk is
not the only one, and this can also happen with other checks (TooMuchRSS,
for example).
Is there some way to force a "PERIODIC_HOLD
recheck" for a particular job, or any other suggested
check?
Thanks
Stefano