[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Many jobs failure for Alice VO



Hi Mihai,

there are two kinds of periodic remove expressions: One for the system affecting all jobs, and one for *each* job. The error message suggests this is the per-job remove expression triggering.

I would recommend to check the VOsâ jobs if they have a `PERIODIC_REMOVE` attribute or similar, then try and find out where it is set.

Cheers,
Max

> On 28. Mar 2023, at 14:17, Mihai Ciubancan <ciubancan@xxxxxxxx> wrote:
> 
> In the last weeks a see a lot of Alice VO(and also LHCb) jobs failing with the following message:
> 
> The job attribute PeriodicRemove expression '(JobStatus == 1 && NumJobStarts > 0) || ((ResidentSetSize =!= undefined ? ResidentSetSize : 0) > JobMemoryLimit)' evaluated to TRUE
> 
> I have set the the remove reason as I saw in a older email in the list (few months ago):
> 
> # SYSTEM_PERIODIC_REMOVE with reasons
> ########################
> 
> # remove jobs running longer than 7 days
> RemoveReadyJobs = (( JobStatus == 2 ) && ( ( CurrentTime - EnteredCurrentStatus ) > 7 * 24 * 3600 ))
> 
> # remove jobs on hold for longer than 7 days
> RemoveHeldJobs = ( (JobStatus==5 && (CurrentTime - EnteredCurrentStatus) > 7 * 24 * 3600) )
> 
> #  remove jobs with to many job starts or shadow starts
> RemoveMultipleRunJobs = ( NumJobStarts >= 10 )
> 
> # remove jobs idle for too long
> MaxJobIdleTime = 7 * 24 * 3600
> RemoveIdleJobs = (( JobStatus == 1 ) && ( ( CurrentTime - EnteredCurrentStatus ) > MaxJobIdleTime ))
> 
> # do it
> SYSTEM_PERIODIC_REMOVE = $(RemoveHeldJobs)           || \
>                         $(RemoveMultipleRunJobs)    || \
>                         $(RemoveIdleJobs)           || \
>                         $(RemoveReadyJobs)
> 
> # set reason for remove
> SYSTEM_PERIODIC_REMOVE_REASON = strcat("Job removed by SYSTEM_PERIODIC_REMOVE due to ", \
> ifThenElse($(RemoveReadyJobs), "runtime longer than reserved", \
> ifThenElse($(RemoveHeldJobs), "being in hold state for 7 days", \
> ifThenElse($(RemoveMultipleRunJobs), "more than 10 failed jobstarts", \
> "being in idle state for 10 days"))),".")
> 
> I have reconfigure the master and schedd daemons, but the problem persist.
> 
> Do you have any idea how to fix this?
> 
> Thank you,
> Mihai
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

Attachment: smime.p7s
Description: S/MIME cryptographic signature