Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor-CE not purging finished jobs

Date: Mon, 18 May 2020 23:27:40 +0200
From: Stefano Dal Pra <stefano.dalpra@xxxxxxxxxxxx>
Subject: Re: [HTCondor-users] HTCondor-CE not purging finished jobs

Hi Brian,

sorry for the 8 hours vs 4 hours confusion. Jobs stay there much longeranyway.

I have set the debug level as you said (on a brand new CE working with "ops" jobs only until now).I also reduced the remove policy to 2 hours (to be sure there issomething to purge).Before reconfiguring i selected jobid on hold for more than 16 hours andfound 12 such jobs:

[root@ce01-lhcb-t2 ~]# condor_ce_q -cons '(JobStatus == 5 ) && (time() -x509UserProxyExpiration > 16 * 3600)'-- Schedd: ce01-lhcb-t2.cr.cnaf.infn.it : <131.154.192.120:28493> @05/18/20 22:27:32

OWNER  BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
ops046 ID: 66       5/16 21:35      _      _      _      1      1 66.0
ops046 ID: 67       5/16 23:35      _      _      _      1      1 67.0
ops046 ID: 68       5/17 01:35      _      _      _      1      1 68.0
ops046 ID: 69       5/17 03:35      _      _      _      1      1 69.0
ops046 ID: 70       5/17 05:35      _      _      _      1      1 70.0
ops046 ID: 71       5/17 07:35      _      _      _      1      1 71.0
ops046 ID: 72       5/17 09:35      _      _      _      1      1 72.0
ops046 ID: 73       5/17 11:35      _      _      _      1      1 73.0
ops046 ID: 74       5/17 13:35      _      _      _      1      1 74.0
ops046 ID: 75       5/17 15:35      _      _      _      1      1 75.0
ops046 ID: 76       5/17 17:35      _      _      _      1      1 76.0
ops046 ID: 77       5/17 19:35      _      _      _      1      1 77.0

After a condor_ce_reconfig (i actually did a restart too) a few of themare gone, and a few are still there:

[root@ce01-lhcb-t2 ~]# condor_ce_q 66.0 67.0 68.0 69.0 70.0 71.0 72.073.0 74.0 75.0 76.0 77.0-- Schedd: ce01-lhcb-t2.cr.cnaf.infn.it : <131.154.192.120:30149> @05/18/20 22:58:16

OWNER  BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
ops046 ID: 68       5/17 01:35      _      _      _      1      1 68.0
ops046 ID: 69       5/17 03:35      _      _      _      1      1 69.0
ops046 ID: 72       5/17 09:35      _      _      _      1      1 72.0
ops046 ID: 73       5/17 11:35      _      _      _      1      1 73.0
ops046 ID: 76       5/17 17:35      _      _      _      1      1 76.0
ops046 ID: 77       5/17 19:35      _      _      _      1      1 77.0

Total for query: 6 jobs; 0 completed, 0 removed, 0 idle, 0 running, 6held, 0 suspendedTotal for all users: 15 jobs; 5 completed, 0 removed, 0 idle, 0 running,10 held, 0 suspended


The SchedLog after reconfig has, for job 71.0 (this has been removed):

05/18/20 22:33:05 (D_ALWAYS:2) abort_job_myself: 71.0 action:Removelog_hold:true

05/18/20 22:33:05 (D_ALWAYS:2) Cleared dirty attributes for job 71.0

05/18/20 22:33:05 (D_ALWAYS:2) Writing record to userlogfile=/var/lib/condor-ce/spool/71/0/cluster71.proc0.subproc0/gridjob.logowner=ops04605/18/20 22:33:05 (D_ALWAYS:2) WriteUserLog::initialize: opened/var/lib/condor-ce/spool/71/0/cluster71.proc0.subproc0/gridjob.logsuccessfully

05/18/20 22:33:05 (D_ALWAYS:2) WriteUserLog::user_priv_flag (~) is 0

05/18/20 22:33:05 (D_ALWAYS) Job 71.0 aborted: CE job removed bySYSTEM_PERIODIC_REMOVE due to being in the hold state for 2 hours.

Looking for job 68.0 (not removed) however, there is nothing afterreconfiguration time (22:30):

05/18/20 21:30:34 (D_ALWAYS:2) abort_job_myself: 68.0 action:Holdlog_hold:true

05/18/20 21:30:34 (D_ALWAYS:2) Cleared dirty attributes for job 68.0

05/18/20 21:30:34 (D_ALWAYS:2) Writing record to userlogfile=/var/lib/condor-ce/spool/68/0/cluster68.proc0.subproc0/gridjob.logowner=ops04605/18/20 21:30:34 (D_ALWAYS:2) WriteUserLog::initialize: opened/var/lib/condor-ce/spool/68/0/cluster68.proc0.subproc0/gridjob.logsuccessfully

05/18/20 21:30:34 (D_ALWAYS:2) WriteUserLog::user_priv_flag (~) is 0

05/18/20 21:30:34 (D_ALWAYS:2) SelfDrainingQueue act_on_job_myself_queueis empty, not resetting timer


And the job is still there:
[root@ce01-lhcb-t2 ~]# condor_ce_q 68.0

-- Schedd: ce01-lhcb-t2.cr.cnaf.infn.it : <131.154.192.120:30149> @05/18/20 23:23:49

OWNER  BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
ops046 ID: 68       5/17 01:35      _      _      _      1      1 68.0

Cheers
Stefano



Il 18/05/20 22:07, Brian Lin ha scritto:

Hi Stefano,
I'm a little confused, your system periodic remove expressions seem toremove jobs that have been held for more than 8 hours whereas yourqueries are looking for held jobs whose proxies have been expired formore than 4 hours. I imagine there's some overlap but they seem likefairly different queries.
Though having the RemoveReason set that ways is pretty strange. If youset "SCHEDD_DEBUG = D_CAT D_ALWAYS:2", you may see some hints in theSchedLog as to why the Schedd is failing to remove these jobs.
Thanks,
Brian

On 5/16/20 11:05 AM, Stefano Dal Pra wrote:
Hello,
htcondor-ce-3.4.0-1.el7.noarch here.

We have a problem common to all of our CEs:
[root@ce02-htc ~]# condor_ce_q -cons '(JobStatus == 5 ) && (time() -x509UserProxyExpiration > 4 * 3600)' -af Owner | sort | uniq -c
   9592 user1
      4 user2
   1114 user3
    575 user4
     44 user5

I have set up REMOVE  and REMOVE REASON rule:
SYSTEM_PERIODIC_REMOVE = (JobStatus == 5 && CurrentTime -EnteredCurrentStatus > 3600*8)SYSTEM_PERIODIC_REMOVE_REASON = strcat("CE job removed bySYSTEM_PERIODIC_REMOVE due to ", ifThenElse((JobStatus == 5 &&CurrentTime - EnteredCurrentStatus > 3600*8), "being in the holdstate for 8 hours.", ifThenElse((JobStatus == 5 &&isUndefined(RoutedToJobId)), "non-existent route or entry inJOB_ROUTER_ENTRIES.", "input files missing." ) ) )
Inspecting these "non purged jobs", they have a RemoveReason set, butthey are not gone nevertheless:
[root@ce02-htc ~]# condor_ce_q 1679707.0 -af JobStatus RemoveReason
5 CE job removed by SYSTEM_PERIODIC_REMOVE due to being in the holdstate for 8 hours.
Until now i have no better way than removing these jobs manuallyusing somethin like:condor_ce_q -cons '(JobStatus == 5 ) && (time() -x509UserProxyExpiration > 4 * 3600)' -af'strcat(ClusterId,".",ProcId)' | xargs condor_ce_rm
Do i miss something obvious?
Cheers,
Stefano
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxxwith a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Follow-Ups:
- Re: [HTCondor-users] HTCondor-CE not purging finished jobs
  - From: Brian Lin

References:
- [HTCondor-users] HTCondor-CE not purging finished jobs
  - From: Stefano Dal Pra
- Re: [HTCondor-users] HTCondor-CE not purging finished jobs
  - From: Brian Lin

Prev by Date: Re: [HTCondor-users] HTCondor-CE not purging finished jobs
Next by Date: Re: [HTCondor-users] HTCondor-CE not purging finished jobs
Previous by thread: Re: [HTCondor-users] HTCondor-CE not purging finished jobs
Next by thread: Re: [HTCondor-users] HTCondor-CE not purging finished jobs
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] HTCondor-CE not purging finished jobs