Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor-CE not purging finished jobs

Date: Wed, 20 May 2020 11:25:12 -0500
From: Brian Lin <blin@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] HTCondor-CE not purging finished jobs

Hi Stefano,

Thanks for running those queries! If EnteredCurrentStatus is constantlyupdating, the jobs must be cycling between job states frequently. Couldyou pick one of these held jobs and run `grep <JOB ID>/var/lib/condor-ce/spool/job_queue.log` as root?

Also, could you send me the entirety of your SchedLog and`condor_ce_config_val -dump` off list?


Thanks,
Brian

On 5/19/20 1:25 AM, Stefano Dal Pra wrote:

Hello Brian,
In my last mail i was considering a 16 hours time limit. Comparingaccording to that:
# x509UserProxyExpiration limit find 8 entries:
[root@ce01-lhcb-t2 ~]# condor_ce_q -cons '(JobStatus == 5 ) && (time()- x509UserProxyExpiration > 16 * 3600)'-- Schedd: ce01-lhcb-t2.cr.cnaf.infn.it : <131.154.192.120:30149> @05/19/20 08:02:10
OWNER  BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
ops046 ID: 68       5/17 01:35      _      _      _      1      1 68.0
ops046 ID: 69       5/17 03:35      _      _      _      1      1 69.0
ops046 ID: 72       5/17 09:35      _      _      _      1      1 72.0
ops046 ID: 73       5/17 11:35      _      _      _      1      1 73.0
ops046 ID: 76       5/17 17:35      _      _      _      1      1 76.0
ops046 ID: 77       5/17 19:35      _      _      _      1      1 77.0
ops046 ID: 80       5/18 01:35      _      _      _      1      1 80.0
ops046 ID: 81       5/18 03:35      _      _      _      1      1 81.0
Total for query: 8 jobs; 0 completed, 0 removed, 0 idle, 0 running, 8held, 0 suspendedTotal for all users: 20 jobs; 6 completed, 0 removed, 0 idle, 0running, 14 held, 0 suspended
# EnteredCurrentStatus limit find None.
[root@ce01-lhcb-t2 ~]# condor_ce_q -cons '(JobStatus == 5 ) && (time()- EnteredCurrentStatus > 16 * 3600)'-- Schedd: ce01-lhcb-t2.cr.cnaf.infn.it : <131.154.192.120:30149> @05/19/20 08:02:30OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTALJOB_IDS
Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0held, 0 suspendedTotal for all users: 20 jobs; 6 completed, 0 removed, 0 idle, 0running, 14 held, 0 suspended
It seems to me that EnteredCurrentStatus in this case is updatedregularly and never become old enough:
[root@ce01-lhcb-t2 ~]# date
mar 19 mag 2020, 08.20.54, CEST
[root@ce01-lhcb-t2 ~]# condor_ce_q 68 69 72 73 76 77 80 81 -af'abstime(EnteredCurrentStatus)'
absTime("2020-05-19T05:32:24+02:00")
absTime("2020-05-19T05:32:24+02:00")
absTime("2020-05-19T05:30:24+02:00")
absTime("2020-05-19T05:30:24+02:00")
absTime("2020-05-19T05:27:23+02:00")
absTime("2020-05-19T05:27:23+02:00")
absTime("2020-05-19T05:26:22+02:00")
absTime("2020-05-19T05:26:22+02:00")

Cheers,
Stefano


Il 18/05/20 23:49, Brian Lin ha scritto:
Hi Stefano,
In addition to the time difference in the system periodic remove vsyour queries, you're also comparing two different attributes againstthe current time: EnteredCurrentStatus and x509UserProxyExpiration,respectively. I'm curious what you see from the following query:
# condor_ce_q -cons '(JobStatus == 5 ) && (time() -EnteredCurrentStatus > 2 * 3600)'
Could you use that list of jobs for your analysis of the SchedLog?

Thanks,
Brian

On 5/18/20 4:27 PM, Stefano Dal Pra wrote:
Hi Brian,
sorry for the 8 hours vs 4 hours confusion. Jobs stay there muchlonger anyway.
I have set the debug level as you said (on a brand new CE workingwith "ops" jobs only until now).I also reduced the remove policy to 2 hours (to be sure there issomething to purge).Before reconfiguring i selected jobid on hold for more than 16 hoursand found 12 such jobs:
[root@ce01-lhcb-t2 ~]# condor_ce_q -cons '(JobStatus == 5 ) &&(time() - x509UserProxyExpiration > 16 * 3600)'-- Schedd: ce01-lhcb-t2.cr.cnaf.infn.it : <131.154.192.120:28493> @05/18/20 22:27:32OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTALJOB_IDS
ops046 ID: 66       5/16 21:35      _      _      _      1 1 66.0
ops046 ID: 67       5/16 23:35      _      _      _      1 1 67.0
ops046 ID: 68       5/17 01:35      _      _      _      1 1 68.0
ops046 ID: 69       5/17 03:35      _      _      _      1 1 69.0
ops046 ID: 70       5/17 05:35      _      _      _      1 1 70.0
ops046 ID: 71       5/17 07:35      _      _      _      1 1 71.0
ops046 ID: 72       5/17 09:35      _      _      _      1 1 72.0
ops046 ID: 73       5/17 11:35      _      _      _      1 1 73.0
ops046 ID: 74       5/17 13:35      _      _      _      1 1 74.0
ops046 ID: 75       5/17 15:35      _      _      _      1 1 75.0
ops046 ID: 76       5/17 17:35      _      _      _      1 1 76.0
ops046 ID: 77       5/17 19:35      _      _      _      1 1 77.0
After a condor_ce_reconfig (i actually did a restart too) a few ofthem are gone, and a few are still there:
[root@ce01-lhcb-t2 ~]# condor_ce_q 66.0 67.0 68.0 69.0 70.0 71.072.0 73.0 74.0 75.0 76.0 77.0-- Schedd: ce01-lhcb-t2.cr.cnaf.infn.it : <131.154.192.120:30149> @05/18/20 22:58:16OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTALJOB_IDS
ops046 ID: 68       5/17 01:35      _      _      _      1 1 68.0
ops046 ID: 69       5/17 03:35      _      _      _      1 1 69.0
ops046 ID: 72       5/17 09:35      _      _      _      1 1 72.0
ops046 ID: 73       5/17 11:35      _      _      _      1 1 73.0
ops046 ID: 76       5/17 17:35      _      _      _      1 1 76.0
ops046 ID: 77       5/17 19:35      _      _      _      1 1 77.0
Total for query: 6 jobs; 0 completed, 0 removed, 0 idle, 0 running,6 held, 0 suspendedTotal for all users: 15 jobs; 5 completed, 0 removed, 0 idle, 0running, 10 held, 0 suspended
The SchedLog after reconfig has, for job 71.0 (this has been removed):
05/18/20 22:33:05 (D_ALWAYS:2) abort_job_myself: 71.0 action:Removelog_hold:true
05/18/20 22:33:05 (D_ALWAYS:2) Cleared dirty attributes for job 71.0
05/18/20 22:33:05 (D_ALWAYS:2) Writing record to userlogfile=/var/lib/condor-ce/spool/71/0/cluster71.proc0.subproc0/gridjob.logowner=ops04605/18/20 22:33:05 (D_ALWAYS:2) WriteUserLog::initialize: opened/var/lib/condor-ce/spool/71/0/cluster71.proc0.subproc0/gridjob.logsuccessfully
05/18/20 22:33:05 (D_ALWAYS:2) WriteUserLog::user_priv_flag (~) is 0
05/18/20 22:33:05 (D_ALWAYS) Job 71.0 aborted: CE job removed bySYSTEM_PERIODIC_REMOVE due to being in the hold state for 2 hours.
Looking for job 68.0 (not removed) however, there is nothing afterreconfiguration time (22:30):
05/18/20 21:30:34 (D_ALWAYS:2) abort_job_myself: 68.0 action:Holdlog_hold:true
05/18/20 21:30:34 (D_ALWAYS:2) Cleared dirty attributes for job 68.0
05/18/20 21:30:34 (D_ALWAYS:2) Writing record to userlogfile=/var/lib/condor-ce/spool/68/0/cluster68.proc0.subproc0/gridjob.logowner=ops04605/18/20 21:30:34 (D_ALWAYS:2) WriteUserLog::initialize: opened/var/lib/condor-ce/spool/68/0/cluster68.proc0.subproc0/gridjob.logsuccessfully
05/18/20 21:30:34 (D_ALWAYS:2) WriteUserLog::user_priv_flag (~) is 0
05/18/20 21:30:34 (D_ALWAYS:2) SelfDrainingQueueact_on_job_myself_queue is empty, not resetting timer
And the job is still there:
[root@ce01-lhcb-t2 ~]# condor_ce_q 68.0
-- Schedd: ce01-lhcb-t2.cr.cnaf.infn.it : <131.154.192.120:30149> @05/18/20 23:23:49OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTALJOB_IDS
ops046 ID: 68       5/17 01:35      _      _      _      1 1 68.0

Cheers
Stefano



Il 18/05/20 22:07, Brian Lin ha scritto:
Hi Stefano,
I'm a little confused, your system periodic remove expressions seemto remove jobs that have been held for more than 8 hours whereasyour queries are looking for held jobs whose proxies have beenexpired for more than 4 hours. I imagine there's some overlap butthey seem like fairly different queries.
Though having the RemoveReason set that ways is pretty strange. Ifyou set "SCHEDD_DEBUG = D_CAT D_ALWAYS:2", you may see some hintsin the SchedLog as to why the Schedd is failing to remove these jobs.
Thanks,
Brian

On 5/16/20 11:05 AM, Stefano Dal Pra wrote:
Hello,
htcondor-ce-3.4.0-1.el7.noarch here.

We have a problem common to all of our CEs:
[root@ce02-htc ~]# condor_ce_q -cons '(JobStatus == 5 ) && (time()- x509UserProxyExpiration > 4 * 3600)' -af Owner | sort | uniq -c
   9592 user1
      4 user2
   1114 user3
    575 user4
     44 user5

I have set up REMOVE  and REMOVE REASON rule:
SYSTEM_PERIODIC_REMOVE = (JobStatus == 5 && CurrentTime -EnteredCurrentStatus > 3600*8)SYSTEM_PERIODIC_REMOVE_REASON = strcat("CE job removed bySYSTEM_PERIODIC_REMOVE due to ", ifThenElse((JobStatus == 5 &&CurrentTime - EnteredCurrentStatus > 3600*8), "being in the holdstate for 8 hours.", ifThenElse((JobStatus == 5 &&isUndefined(RoutedToJobId)), "non-existent route or entry inJOB_ROUTER_ENTRIES.", "input files missing." ) ) )
Inspecting these "non purged jobs", they have a RemoveReason set,but they are not gone nevertheless:
[root@ce02-htc ~]# condor_ce_q 1679707.0 -af JobStatus RemoveReason
5 CE job removed by SYSTEM_PERIODIC_REMOVE due to being in thehold state for 8 hours.
Until now i have no better way than removing these jobs manuallyusing somethin like:condor_ce_q -cons '(JobStatus == 5 ) && (time() -x509UserProxyExpiration > 4 * 3600)' -af'strcat(ClusterId,".",ProcId)' | xargs condor_ce_rm
Do i miss something obvious?
Cheers,
Stefano
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message tohtcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Follow-Ups:
- Re: [HTCondor-users] HTCondor-CE not purging finished jobs
  - From: Brian Lin

References:
- [HTCondor-users] HTCondor-CE not purging finished jobs
  - From: Stefano Dal Pra
- Re: [HTCondor-users] HTCondor-CE not purging finished jobs
  - From: Brian Lin
- Re: [HTCondor-users] HTCondor-CE not purging finished jobs
  - From: Stefano Dal Pra
- Re: [HTCondor-users] HTCondor-CE not purging finished jobs
  - From: Brian Lin
- Re: [HTCondor-users] HTCondor-CE not purging finished jobs
  - From: Stefano Dal Pra

Prev by Date: Re: [HTCondor-users] max_materialize not working with HTC 8.9.3 (failed to create ClassAd for Job)
Next by Date: [HTCondor-users] HTCondor 8.9.7 Released
Previous by thread: Re: [HTCondor-users] HTCondor-CE not purging finished jobs
Next by thread: Re: [HTCondor-users] HTCondor-CE not purging finished jobs
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] HTCondor-CE not purging finished jobs