Hello all,
On a test cluster with 3 vm (CentOS Linux release 7.6.1810 ), with one scheduler (ARC-CE), one manager and one worker.
$CondorVersion: 8.6.12 Jul 31 2018 BuildID: 446077
I have defined
SYSTEM_PERIODIC_REMOVE and SYSTEM_PERIODIC_REMOVE_REASON, and MAX_TRANSFER_OUTPUT_MB = 50
(see below).
Running a very simple job (dd /dev/zero for 100M), the job is put to hold state at the time the transfer exceeds the 50M.
I understand that after “JobStatus == 5 && ( CurrentTime - EnteredCurrentStatus ) > 10 * 60” (ie 10 min) the job is removed.
But then, the final error return to the user is usually ( because I happened to be something else sometime for the same test):
State: Failed
Job Error: LRMS error: (-1) RemoveReason: The system macro SYSTEM_PERIODIC_REMOVE _expression_ '( RemoteWallClockTime > 80 * 60 * 60 ) || ( RemoteSysCpu + RemoteUserCpu > 10 * 60 * 60 ) || ( ( JobStatus == 5 && ( CurrentTime - EnteredCurrentStatus
) > 10 * 60 ) ) || ( ResidentSetSize_RAW > 1000 * RequestMemory ) || ( JobRunCount > 10 )' evaluated to TRUE
è
I’d like to have condor return the actual
SYSTEM_PERIODIC_REMOVE_REASON rather than” SYSTEM_PERIODIC_REMOVE evaluated to true”.
How can I defined that ?
Many thanks
Sophie
<<
## Time limits
RemoveDefaultJobWallTime = ( RemoteWallClockTime > <%= @max_walltime %> )
RemoveDefaultJobCpuTime = ( RemoteSysCpu + RemoteUserCpu > <%= @max_cputime %> )
## Memory usage limit
RemoveMemoryUsage = ( ResidentSetSize_RAW > <%= @memory_factor %>*RequestMemory )
## Held jobs - don't want them to stay in the system forever
RemoveHeldJobs = ( (JobStatus==5 && (CurrentTime - EnteredCurrentStatus) > 10 * 60) )
## Make sure jobs don't start running too many times
RemoveMultipleRunJobs = ( JobRunCount > 10 )
## Add test output file too big
MAX_TRANSFER_OUTPUT_MB = 50
# move this to condor_confog_local , test
### Removal of jobs exceeding resource limits
SYSTEM_PERIODIC_REMOVE = $(RemoveDefaultJobWallTime) || \
$(RemoveDefaultJobCpuTime) || \
$(RemoveHeldJobs) || \
$(RemoveMemoryUsage) || \
$(RemoveMultipleRunJobs)
##
##
## Add To fix SYSTEM PERIODIC REMOVE
SYSTEM_PERIODIC_REMOVE_REASON = strcat("Job removed by SYSTEM_PERIODIC_REMOVE" due to ",\
ifThenElse(( $(RemoveDefaultJobWallTime) ) , "** RemoteWallClockTime over limit " ,\
ifThenElse(( $(RemoveDefaultJobCpuTime) ) , "** RemoveDefaultJobCpuTime over limit " ,\
ifThenElse(( $(RemoveHeldJobs) ) , "** Held job status over time limit" ,\
ifThenElse(( $(RemoveMemoryUsage) ) , "** Memory usage over limit" ,\
ifThenElse(( $(RemoveMultipleRunJobs) ) , "** Job run over limit number of time" ,\
ifThenElse((JobStatus == 5 && LastHoldReasonCode == 33 && HoldReasonSubCode ==0), "Output file size > MAX_TRANSFER_OUTPUT_MB" , \
"** other undefined reason") ))))), ".")
>>
Sophie Ferry
CEA Saclay
91191 Gif-Sur-Yvette
DRF/IRFU/DEDIP/LIS
GRIF-IRFU
Bat.141, p.023B
+33(0)1 69 08 76 45