Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] stdout/stderr for evicted jobs
- Date: Wed, 27 Jul 2022 16:43:18 +0200
- From: Stefano Dal Pra <stefano.dalpra@xxxxxxxxxxxx>
- Subject: Re: [HTCondor-users] stdout/stderr for evicted jobs
Hrello Maarten,
On 27/07/22 15:42, Maarten.Litmaath@xxxxxxx wrote:
Hi all,
1. How can one make a grid job fail on its very first eviction?
I do not want HTCondor to try and "rescue" such jobs...
Here is what we set at cnaf:
- in the batch/side schedd config of the CE:
# a few boolean conditions for jobs to evict (put on hold)
SecondStart = (NumJobStarts == 1 && JobStatus == 1)
TooMuchDiskÂÂ = (DiskUsage_raw > 35 * (CpusProvisioned ?: RequestCpus) *
1024000)
TooMuchRSS = (ResidentSetSize_RAW > 40 * (CpusProvisioned ?:
RequestCpus) * 1e6 )
TooMuchTimeÂÂ = (jobstatus == 2 && (time() - JobStartDate > 86400 * 7))
# put on hold a job who meet at least one of the above
SYSTEM_PERIODIC_HOLD = $(SYSTEM_PERIODIC_HOLD:False) || $(SecondStart)
|| $(TooMuchDisk) || $(TooMuchRSS) || $(TooMuchTime)
#cumbersome way to log which hold reason applied, plus job owner.
SYSTEM_PERIODIC_HOLD_REASON = strcat(Owner,{"",", Second start not
allowed",", TooMuchDisk: 35GB/core", ", TooMuchRSS: 20GB/core","Job
runtime > 20
days"}[max({int($(SecondStart)),int($(TooMuchDisk))*2,int($(TooMuchRSS))*3,int($(TooMuchTime))*4})])
#purge from the queue jobs on hold for more than 2 hours.
SYSTEM_PERIODIC_REMOVE = ( $(SYSTEM_PERIODIC_REMOVE:False) || (JobStatus
== 5 && (CurrentTime - EnteredCurrentStatus > 3600 *2) ))
SYSTEM_PERIODIC_REMOVE_REASON = strcat("local job removed by
SYSTEM_PERIODIC_REMOVE due to ", ifThenElse((JobStatus == 5 &&
CurrentTime - EnteredCurrentStatus > 3600*2), "being in the hold state
for 2 hours.","Unclear reason, see ce.conf"))
####
In the condor-ce jobrouter we also set:
JOB_ROUTER_USE_DEPRECATED_ROUTER_ENTRIES = False
# Jobs can only start once.
JOB_ROUTER_TRANSFORM_PeriodicHold @=jrt
 SET Periodic_Hold = (NumJobStarts >= 1 && JobStatus == 1) ||
NumJobStarts > 1
@jrt
Stefano
[SNIP]