On 28. Mar 2023, at 14:17, Mihai Ciubancan <ciubancan@xxxxxxxx> wrote:
In the last weeks a see a lot of Alice VO(and also LHCb) jobs failing
with the following message:
The job attribute PeriodicRemove expression '(JobStatus == 1 &&
NumJobStarts > 0) || ((ResidentSetSize =!= undefined ? ResidentSetSize
: 0) > JobMemoryLimit)' evaluated to TRUE
I have set the the remove reason as I saw in a older email in the list
(few months ago):
# SYSTEM_PERIODIC_REMOVE with reasons
########################
# remove jobs running longer than 7 days
RemoveReadyJobs = (( JobStatus == 2 ) && ( ( CurrentTime -
EnteredCurrentStatus ) > 7 * 24 * 3600 ))
# remove jobs on hold for longer than 7 days
RemoveHeldJobs = ( (JobStatus==5 && (CurrentTime -
EnteredCurrentStatus) > 7 * 24 * 3600) )
# remove jobs with to many job starts or shadow starts
RemoveMultipleRunJobs = ( NumJobStarts >= 10 )
# remove jobs idle for too long
MaxJobIdleTime = 7 * 24 * 3600
RemoveIdleJobs = (( JobStatus == 1 ) && ( ( CurrentTime -
EnteredCurrentStatus ) > MaxJobIdleTime ))
# do it
SYSTEM_PERIODIC_REMOVE = $(RemoveHeldJobs) || \
$(RemoveMultipleRunJobs) || \
$(RemoveIdleJobs) || \
$(RemoveReadyJobs)
# set reason for remove
SYSTEM_PERIODIC_REMOVE_REASON = strcat("Job removed by
SYSTEM_PERIODIC_REMOVE due to ", \
ifThenElse($(RemoveReadyJobs), "runtime longer than reserved", \
ifThenElse($(RemoveHeldJobs), "being in hold state for 7 days", \
ifThenElse($(RemoveMultipleRunJobs), "more than 10 failed jobstarts",
\
"being in idle state for 10 days"))),".")
I have reconfigure the master and schedd daemons, but the problem
persist.
Do you have any idea how to fix this?
Thank you,
Mihai
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/