Re: [HTCondor-users] SYSTEM_PERIODIC_REMOVE /vs/ SYSTEM_PERIODIC_REMOVE

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Hi Sophie,

The catch here is that there is no system_periodic_remove_reason, only the hold_reason.

I would recommend changing your periodic remove to a system_periodic_hold, and then set up the periodic remove to only deal with removing held jobs older than a certain age.

When a job is held the user will get a notification of a periodic hold if the notification attribute is set to Always or Error.

Everything else looks good – nested ifThenElse() statements should be demonstrated in the manual, they’re quite handy aren’t they?

Michael V. Pelletier
Information Technology
Digital Transformation & Innovation
Integrated Defense Systems
Raytheon Company

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of FERRY Sophie
Sent: Thursday, December 20, 2018 8:07 AM
To: htcondor-users@xxxxxxxxxxx
Subject: [External] [HTCondor-users] SYSTEM_PERIODIC_REMOVE /vs/ SYSTEM_PERIODIC_REMOVE_REASON

Hello all,

On a test cluster with 3 vm (CentOS Linux release 7.6.1810 ), with one scheduler (ARC-CE), one manager and one worker.

$CondorVersion: 8.6.12 Jul 31 2018 BuildID: 446077

I have defined SYSTEM_PERIODIC_REMOVE and SYSTEM_PERIODIC_REMOVE_REASON, and MAX_TRANSFER_OUTPUT_MB = 50

(see below).

Running a very simple job (dd /dev/zero for 100M), the job is put to hold state at the time the transfer exceeds the 50M.

I understand that after “JobStatus == 5 && ( CurrentTime - EnteredCurrentStatus ) > 10 * 60” (ie 10 min) the job is removed.

But then, the final error return to the user is usually ( because I happened to be something else sometime for the same test):

State: Failed

Job Error: LRMS error: (-1) RemoveReason: The system macro SYSTEM_PERIODIC_REMOVE _expression_ '( RemoteWallClockTime > 80 * 60 * 60 ) || ( RemoteSysCpu + RemoteUserCpu > 10 * 60 * 60 ) || ( ( JobStatus == 5 && ( CurrentTime - EnteredCurrentStatus ) > 10 * 60 ) ) || ( ResidentSetSize_RAW > 1000 * RequestMemory ) || ( JobRunCount > 10 )' evaluated to TRUE

è I’d like to have condor return the actual SYSTEM_PERIODIC_REMOVE_REASON rather than” SYSTEM_PERIODIC_REMOVE evaluated to true”.

How can I defined that ?

Many thanks

Sophie

## Time limits
RemoveDefaultJobWallTime = ( RemoteWallClockTime > <%= @max_walltime %> )
RemoveDefaultJobCpuTime = ( RemoteSysCpu + RemoteUserCpu > <%= @max_cputime %> )

## Memory usage limit
RemoveMemoryUsage = ( ResidentSetSize_RAW > <%= @memory_factor %>*RequestMemory )

## Held jobs - don't want them to stay in the system forever
RemoveHeldJobs = ( (JobStatus==5 && (CurrentTime - EnteredCurrentStatus) > 10 * 60) )

## Make sure jobs don't start running too many times
RemoveMultipleRunJobs = ( JobRunCount > 10 )

## Add test output file too big
MAX_TRANSFER_OUTPUT_MB = 50

# move this to condor_confog_local , test
### Removal of jobs exceeding resource limits
SYSTEM_PERIODIC_REMOVE = $(RemoveDefaultJobWallTime) || \
                         $(RemoveDefaultJobCpuTime) || \
                         $(RemoveHeldJobs)           || \
                         $(RemoveMemoryUsage)        || \
                         $(RemoveMultipleRunJobs)
##
##
## Add To fix SYSTEM PERIODIC REMOVE
SYSTEM_PERIODIC_REMOVE_REASON = strcat("Job removed by SYSTEM_PERIODIC_REMOVE" due to ",\
    ifThenElse(( $(RemoveDefaultJobWallTime) ) , "** RemoteWallClockTime over limit " ,\
    ifThenElse(( $(RemoveDefaultJobCpuTime) ) , "** RemoveDefaultJobCpuTime over limit " ,\
    ifThenElse(( $(RemoveHeldJobs)           ) , "** Held job status over time limit" ,\
    ifThenElse(( $(RemoveMemoryUsage)        ) , "** Memory usage over limit" ,\
    ifThenElse(( $(RemoveMultipleRunJobs)    ) , "** Job run over limit number of time" ,\
    ifThenElse((JobStatus == 5 && LastHoldReasonCode == 33 && HoldReasonSubCode ==0), "Output file size > MAX_TRANSFER_OUTPUT_MB" , \
    "** other undefined reason") ))))), ".")
>>

Sophie Ferry

CEA Saclay

91191 Gif-Sur-Yvette

DRF/IRFU/DEDIP/LIS

GRIF-IRFU

Bat.141, p.023B

+33(0)1 69 08 76 45

Mailing List Archives

Authenticated access

Re: [HTCondor-users] SYSTEM_PERIODIC_REMOVE /vs/ SYSTEM_PERIODIC_REMOVE_REASON