Mailing List Archives
Authenticated access
UW Madison
Computer Sciences Department
Computer Systems Lab
[
Date Prev
][
Date Next
][
Thread Prev
][
Thread Next
][
Date Index
][
Thread Index
]
[HTCondor-users] jobs surviving periodic_hold condition
Date
: Mon, 22 Aug 2022 11:57:20 +0200
From
: Stefano Dal Pra <
stefano.dalpra@xxxxxxxxxxxx
>
Subject
: [HTCondor-users] jobs surviving periodic_hold condition
Hello,
condor 9.0.13 here.
We observe running jobs that should have been put on hold by the schedd for using too much disk space.
The
SYSTEM_PERIODIC_HOLD
clause is
SYSTEM_PERIODIC_HOLD = $(SYSTEM_PERIODIC_HOLD:False) || $(SecondStart) || $(TooMuchDisk) || $(TooMuchRSS) || $(TooMuchTime)
And the conditions are:
SecondStart = (NumJobStarts == 1 && JobStatus == 1)
TooMuchDisk ÂÂ= (DiskUsage_raw > 35 * (CpusProvisioned ?: RequestCpus) * 1024000)
TooMuchRSS = (ResidentSetSize_RAW > 40 * (CpusProvisioned ?: RequestCpus) * 1e6 )
TooMuchTime ÂÂ= (jobstatus == 2 && (time() - JobStartDate > 86400 * 7))
This usually works but there are jobs at times that survive after going over the
TooMuchDisk condition:
[root@ce03-htc ~]# condor_q -all -cons 'jobstatus == 2 && DiskUsage_RAW/1e6 > 40 * CpusProvisioned' -af:j owner scheddhostname CpusProvisioned 'split(remotehost ?: lastremotehost,".@")[1]' 'Dis
kUsage_RAW/1e6' 'ImageSize_RAW/1e6' 'interval(time()-jobstartdate)' | sort -n -k 6
7968490.0 pilatlas003 ce03-htc 1 wn-204-13-09-06-a 40.236304 8.760128 1+16:42:51
7968484.0 pilatlas003 ce03-htc 1 wn-204-11-05-04-a 46.117129 8.7592 1+16:42:52
7968498.0 pilatlas003 ce03-htc 1 wn-205-13-01-07-a 47.892963 7.084336 1+16:42:51
7979553.0 pilatlas003 ce03-htc 1 cn-313-06-02 76.593148 8.441587999999999 2:15:05
7979600.0 pilatlas003 ce03-htc 1 cn-313-06-05 76.431658 6.094996 2:03:12
7979204.0 pilatlas003 ce03-htc 1 wn-200-11-11-01-a 98.61471400000001 4.884384 3:32:57
7979205.0 pilatlas003 ce03-htc 1 cn-313-06-08 150.625147 33.13694 3:32:19
Here job
7979205.0
is taking 150.6GB disk space.
I verified that after a condor_restart the shadow for these jobs detect their condition and put them on hold:
[root@ce02-htc ~]# systemctl restart condor && tail -f /var/log/condor/SchedLog | egrep '8039630.0|8063116.0|8064391.0|8064828.0|8066107.0'
08/22/22 11:17:20 (pid:3385936) Starting add_shadow_birthdate(
8039630.0
)
[...]
08/22/22 11:18:22 (pid:3385936) Shadow pid 3391245 for job
8039630.0
exited with status 112
08/22/22 11:18:22 (pid:3385936) Putting job
8039630.0
on hold
[root@ce02-htc ~]# condor_q 8039630.0 -af holdreason
pilatlas002, TooMuchDisk: 35GB/core
This makes me think that somehow the shadow process might fail at detecting a condition once and for all the next attempts.
I think that the
TooMuchDisk is not the only one, and this can also happen with other checks (
TooMuchRSS, for example).
Is there some way to force a "
PERIODIC_HOLD
recheck" for a particular job, or any other suggested check?
Thanks
Stefano
Follow-Ups
:
Re: [HTCondor-users] jobs surviving periodic_hold condition
From:
Todd Tannenbaum
Prev by Date:
[HTCondor-users] HTCondor file-transfer vs networked storage
Next by Date:
[HTCondor-users] Using classads in DOCKER_EXTRA_ARGUMENTS
Previous by thread:
Re: [HTCondor-users] HTCondor file-transfer vs networked storage
Next by thread:
Re: [HTCondor-users] jobs surviving periodic_hold condition
Index(es):
Date
Thread