[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Many jobs failure for Alice VO



Hi Mihai,

Were you able to determine if one of the VO users were setting periodic_remove in their jobs?

Thanks

Jason Patton

On Thu, Mar 30, 2023 at 7:36âAM Mihai Ciubancan <ciubancan@xxxxxxxx> wrote:
Hi Max,

Thank you for the hint!

I will check it.

Cheers,
Mihai


On 2023-03-29 14:30, Fischer, Max (SCC) wrote:
> Hi Mihai,
>
> there are two kinds of periodic remove expressions: One for the system
> affecting all jobs, and one for *each* job. The error message suggests
> this is the per-job remove _expression_ triggering.
>
> I would recommend to check the VOsâ jobs if they have a
> `PERIODIC_REMOVE` attribute or similar, then try and find out where it
> is set.
>
> Cheers,
> Max
>
>> On 28. Mar 2023, at 14:17, Mihai Ciubancan <ciubancan@xxxxxxxx> wrote:
>>
>> In the last weeks a see a lot of Alice VO(and also LHCb) jobs failing
>> with the following message:
>>
>> The job attribute PeriodicRemove _expression_ '(JobStatus == 1 &&
>> NumJobStarts > 0) || ((ResidentSetSize =!= undefined ? ResidentSetSize
>> : 0) > JobMemoryLimit)' evaluated to TRUE
>>
>> I have set the the remove reason as I saw in a older email in the list
>> (few months ago):
>>
>> # SYSTEM_PERIODIC_REMOVE with reasons
>> ########################
>>
>> # remove jobs running longer than 7 days
>> RemoveReadyJobs = (( JobStatus == 2 ) && ( ( CurrentTime -
>> EnteredCurrentStatus ) > 7 * 24 * 3600 ))
>>
>> # remove jobs on hold for longer than 7 days
>> RemoveHeldJobs = ( (JobStatus==5 && (CurrentTime -
>> EnteredCurrentStatus) > 7 * 24 * 3600) )
>>
>> #Â remove jobs with to many job starts or shadow starts
>> RemoveMultipleRunJobs = ( NumJobStarts >= 10 )
>>
>> # remove jobs idle for too long
>> MaxJobIdleTime = 7 * 24 * 3600
>> RemoveIdleJobs = (( JobStatus == 1 ) && ( ( CurrentTime -
>> EnteredCurrentStatus ) > MaxJobIdleTime ))
>>
>> # do it
>> SYSTEM_PERIODIC_REMOVE = $(RemoveHeldJobs)Â Â Â Â Â Â|| \
>>Â Â Â Â Â Â Â Â Â Â Â Â Â$(RemoveMultipleRunJobs)Â Â || \
>>Â Â Â Â Â Â Â Â Â Â Â Â Â$(RemoveIdleJobs)Â Â Â Â Â Â|| \
>>Â Â Â Â Â Â Â Â Â Â Â Â Â$(RemoveReadyJobs)
>>
>> # set reason for remove
>> SYSTEM_PERIODIC_REMOVE_REASON = strcat("Job removed by
>> SYSTEM_PERIODIC_REMOVE due to ", \
>> ifThenElse($(RemoveReadyJobs), "runtime longer than reserved", \
>> ifThenElse($(RemoveHeldJobs), "being in hold state for 7 days", \
>> ifThenElse($(RemoveMultipleRunJobs), "more than 10 failed jobstarts",
>> \
>> "being in idle state for 10 days"))),".")
>>
>> I have reconfigure the master and schedd daemons, but the problem
>> persist.
>>
>> Do you have any idea how to fix this?
>>
>> Thank you,
>> Mihai
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>> with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/