[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Many jobs failure for Alice VO



In the last weeks a see a lot of Alice VO(and also LHCb) jobs failing with the following message:

The job attribute PeriodicRemove expression '(JobStatus == 1 && NumJobStarts > 0) || ((ResidentSetSize =!= undefined ? ResidentSetSize : 0) > JobMemoryLimit)' evaluated to TRUE

I have set the the remove reason as I saw in a older email in the list (few months ago):

# SYSTEM_PERIODIC_REMOVE with reasons
########################

# remove jobs running longer than 7 days
RemoveReadyJobs = (( JobStatus == 2 ) && ( ( CurrentTime - EnteredCurrentStatus ) > 7 * 24 * 3600 ))

# remove jobs on hold for longer than 7 days
RemoveHeldJobs = ( (JobStatus==5 && (CurrentTime - EnteredCurrentStatus) > 7 * 24 * 3600) )

#  remove jobs with to many job starts or shadow starts
RemoveMultipleRunJobs = ( NumJobStarts >= 10 )

# remove jobs idle for too long
MaxJobIdleTime = 7 * 24 * 3600
RemoveIdleJobs = (( JobStatus == 1 ) && ( ( CurrentTime - EnteredCurrentStatus ) > MaxJobIdleTime ))

# do it
SYSTEM_PERIODIC_REMOVE = $(RemoveHeldJobs)           || \
                         $(RemoveMultipleRunJobs)    || \
                         $(RemoveIdleJobs)           || \
                         $(RemoveReadyJobs)

# set reason for remove
SYSTEM_PERIODIC_REMOVE_REASON = strcat("Job removed by SYSTEM_PERIODIC_REMOVE due to ", \
 ifThenElse($(RemoveReadyJobs), "runtime longer than reserved", \
 ifThenElse($(RemoveHeldJobs), "being in hold state for 7 days", \
 ifThenElse($(RemoveMultipleRunJobs), "more than 10 failed jobstarts", \
 "being in idle state for 10 days"))),".")

I have reconfigure the master and schedd daemons, but the problem persist.

Do you have any idea how to fix this?

Thank you,
Mihai