Hi Mihai,
could it simply have been bad jobs exceeding the JobMemoryLimit?
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Mihai Ciubancan <ciubancan@xxxxxxxx>
Sent: Monday, April 3, 2023 9:48 AM To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> Subject: Re: [HTCondor-users] Many jobs failure for Alice VO Hi Jason,
I wasn't able to find this info. But in the meanwhile I have modify also the fairshare policyand I saw that form the moment it dropped the number of failed jobs: [root@arc6atlas1 ~]# condor_history |grep alice01|head -7500|grep " X "|wc -l 1192 [root@arc6atlas1 ~]# condor_history |grep alice01|head -3500|grep " X "|wc -l 175 Also it dropped for LHCb too in the last 5 days. Anyway I will keep an eye on the situation for the next days to see if the number of failed jobs remained low. Thank you, Mihai So I will keep an eye on it to see if On 2023-03-31 23:29, Jason Patton via HTCondor-users wrote: > Hi Mihai, > > Were you able to determine if one of the VO users were setting > periodic_remove in their jobs? > > Thanks > > Jason Patton > > On Thu, Mar 30, 2023 at 7:36âAM Mihai Ciubancan <ciubancan@xxxxxxxx> > wrote: > >> Hi Max, >> >> Thank you for the hint! >> >> I will check it. >> >> Cheers, >> Mihai >> >> On 2023-03-29 14:30, Fischer, Max (SCC) wrote: >>> Hi Mihai, >>> >>> there are two kinds of periodic remove expressions: One for the >> system >>> affecting all jobs, and one for *each* job. The error message >> suggests >>> this is the per-job remove _expression_ triggering. >>> >>> I would recommend to check the VOsâ jobs if they have a >>> `PERIODIC_REMOVE` attribute or similar, then try and find out >> where it >>> is set. >>> >>> Cheers, >>> Max >>> >>>> On 28. Mar 2023, at 14:17, Mihai Ciubancan <ciubancan@xxxxxxxx> >> wrote: >>>> >>>> In the last weeks a see a lot of Alice VO(and also LHCb) jobs >> failing >>>> with the following message: >>>> >>>> The job attribute PeriodicRemove _expression_ '(JobStatus == 1 && >>>> NumJobStarts > 0) || ((ResidentSetSize =!= undefined ? >> ResidentSetSize >>>> : 0) > JobMemoryLimit)' evaluated to TRUE >>>> >>>> I have set the the remove reason as I saw in a older email in the >> list >>>> (few months ago): >>>> >>>> # SYSTEM_PERIODIC_REMOVE with reasons >>>> ######################## >>>> >>>> # remove jobs running longer than 7 days >>>> RemoveReadyJobs = (( JobStatus == 2 ) && ( ( CurrentTime - >>>> EnteredCurrentStatus ) > 7 * 24 * 3600 )) >>>> >>>> # remove jobs on hold for longer than 7 days >>>> RemoveHeldJobs = ( (JobStatus==5 && (CurrentTime - >>>> EnteredCurrentStatus) > 7 * 24 * 3600) ) >>>> >>>> # remove jobs with to many job starts or shadow starts >>>> RemoveMultipleRunJobs = ( NumJobStarts >= 10 ) >>>> >>>> # remove jobs idle for too long >>>> MaxJobIdleTime = 7 * 24 * 3600 >>>> RemoveIdleJobs = (( JobStatus == 1 ) && ( ( CurrentTime - >>>> EnteredCurrentStatus ) > MaxJobIdleTime )) >>>> >>>> # do it >>>> SYSTEM_PERIODIC_REMOVE = $(RemoveHeldJobs) || \ >>>> $(RemoveMultipleRunJobs) || \ >>>> $(RemoveIdleJobs) || \ >>>> $(RemoveReadyJobs) >>>> >>>> # set reason for remove >>>> SYSTEM_PERIODIC_REMOVE_REASON = strcat("Job removed by >>>> SYSTEM_PERIODIC_REMOVE due to ", \ >>>> ifThenElse($(RemoveReadyJobs), "runtime longer than reserved", \ >>>> ifThenElse($(RemoveHeldJobs), "being in hold state for 7 days", \ >>>> ifThenElse($(RemoveMultipleRunJobs), "more than 10 failed >> jobstarts", >>>> \ >>>> "being in idle state for 10 days"))),".") >>>> >>>> I have reconfigure the master and schedd daemons, but the problem >> >>>> persist. >>>> >>>> Do you have any idea how to fix this? >>>> >>>> Thank you, >>>> Mihai >>>> _______________________________________________ >>>> HTCondor-users mailing list >>>> To unsubscribe, send a message to >> htcondor-users-request@xxxxxxxxxxx >>>> with a >>>> subject: Unsubscribe >>>> You can also unsubscribe by visiting >>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users >>>> >>>> The archives can be found at: >>>> https://lists.cs.wisc.edu/archive/htcondor-users/ >>> >>> >>> _______________________________________________ >>> HTCondor-users mailing list >>> To unsubscribe, send a message to >> htcondor-users-request@xxxxxxxxxxx >>> with a >>> subject: Unsubscribe >>> You can also unsubscribe by visiting >>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users >>> >>> The archives can be found at: >>> https://lists.cs.wisc.edu/archive/htcondor-users/ >> _______________________________________________ >> HTCondor-users mailing list >> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx >> with a >> subject: Unsubscribe >> You can also unsubscribe by visiting >> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users >> >> The archives can be found at: >> https://lists.cs.wisc.edu/archive/htcondor-users/ > _______________________________________________ > HTCondor-users mailing list > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx > with a > subject: Unsubscribe > You can also unsubscribe by visiting > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users > > The archives can be found at: > https://lists.cs.wisc.edu/archive/htcondor-users/ _______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/ |