Mailing List Archives
	Authenticated access
	
	
     | 
    
	 
	 
     | 
    
	
	 
     | 
  
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] HTCondor-CE not purging finished jobs
- Date: Mon, 18 May 2020 23:27:40 +0200
 
- From: Stefano Dal Pra <stefano.dalpra@xxxxxxxxxxxx>
 
- Subject: Re: [HTCondor-users] HTCondor-CE not purging finished jobs
 
Hi Brian,
sorry for the 8 hours vs 4 hours confusion. Jobs stay there much longer 
anyway.
I have set the debug level as you said (on a brand new CE working with  
"ops" jobs only until now).
I  also reduced the remove policy to 2 hours (to be sure there is 
something to purge).
Before reconfiguring i selected jobid on hold for more than 16 hours and 
found  12 such jobs:
[root@ce01-lhcb-t2 ~]# condor_ce_q -cons '(JobStatus == 5 ) && (time() - 
x509UserProxyExpiration > 16 * 3600)'
-- Schedd: ce01-lhcb-t2.cr.cnaf.infn.it : <131.154.192.120:28493> @ 
05/18/20 22:27:32
OWNER  BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
ops046 ID: 66       5/16 21:35      _      _      _      1      1 66.0
ops046 ID: 67       5/16 23:35      _      _      _      1      1 67.0
ops046 ID: 68       5/17 01:35      _      _      _      1      1 68.0
ops046 ID: 69       5/17 03:35      _      _      _      1      1 69.0
ops046 ID: 70       5/17 05:35      _      _      _      1      1 70.0
ops046 ID: 71       5/17 07:35      _      _      _      1      1 71.0
ops046 ID: 72       5/17 09:35      _      _      _      1      1 72.0
ops046 ID: 73       5/17 11:35      _      _      _      1      1 73.0
ops046 ID: 74       5/17 13:35      _      _      _      1      1 74.0
ops046 ID: 75       5/17 15:35      _      _      _      1      1 75.0
ops046 ID: 76       5/17 17:35      _      _      _      1      1 76.0
ops046 ID: 77       5/17 19:35      _      _      _      1      1 77.0
After a condor_ce_reconfig  (i actually did a restart too) a few of them 
are gone, and a few are still there:
[root@ce01-lhcb-t2 ~]# condor_ce_q 66.0 67.0 68.0 69.0 70.0 71.0 72.0 
73.0 74.0 75.0 76.0 77.0
-- Schedd: ce01-lhcb-t2.cr.cnaf.infn.it : <131.154.192.120:30149> @ 
05/18/20 22:58:16
OWNER  BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
ops046 ID: 68       5/17 01:35      _      _      _      1      1 68.0
ops046 ID: 69       5/17 03:35      _      _      _      1      1 69.0
ops046 ID: 72       5/17 09:35      _      _      _      1      1 72.0
ops046 ID: 73       5/17 11:35      _      _      _      1      1 73.0
ops046 ID: 76       5/17 17:35      _      _      _      1      1 76.0
ops046 ID: 77       5/17 19:35      _      _      _      1      1 77.0
Total for query: 6 jobs; 0 completed, 0 removed, 0 idle, 0 running, 6 
held, 0 suspended
Total for all users: 15 jobs; 5 completed, 0 removed, 0 idle, 0 running, 
10 held, 0 suspended
The SchedLog after reconfig has, for job 71.0 (this has been removed):
 05/18/20 22:33:05 (D_ALWAYS:2) abort_job_myself: 71.0 action:Remove 
log_hold:true
05/18/20 22:33:05 (D_ALWAYS:2) Cleared dirty attributes for job 71.0
05/18/20 22:33:05 (D_ALWAYS:2) Writing record to user 
logfile=/var/lib/condor-ce/spool/71/0/cluster71.proc0.subproc0/gridjob.log 
owner=ops046
05/18/20 22:33:05 (D_ALWAYS:2) WriteUserLog::initialize: opened 
/var/lib/condor-ce/spool/71/0/cluster71.proc0.subproc0/gridjob.log 
successfully
05/18/20 22:33:05 (D_ALWAYS:2) WriteUserLog::user_priv_flag (~) is 0
05/18/20 22:33:05 (D_ALWAYS) Job 71.0 aborted: CE job removed by 
SYSTEM_PERIODIC_REMOVE due to being in the hold state for 2 hours.
Looking for job 68.0 (not removed) however, there is nothing after 
reconfiguration time (22:30):
05/18/20 21:30:34 (D_ALWAYS:2) abort_job_myself: 68.0 action:Hold 
log_hold:true
05/18/20 21:30:34 (D_ALWAYS:2) Cleared dirty attributes for job 68.0
05/18/20 21:30:34 (D_ALWAYS:2) Writing record to user 
logfile=/var/lib/condor-ce/spool/68/0/cluster68.proc0.subproc0/gridjob.log 
owner=ops046
05/18/20 21:30:34 (D_ALWAYS:2) WriteUserLog::initialize: opened 
/var/lib/condor-ce/spool/68/0/cluster68.proc0.subproc0/gridjob.log 
successfully
05/18/20 21:30:34 (D_ALWAYS:2) WriteUserLog::user_priv_flag (~) is 0
05/18/20 21:30:34 (D_ALWAYS:2) SelfDrainingQueue act_on_job_myself_queue 
is empty, not resetting timer
And the job is still there:
[root@ce01-lhcb-t2 ~]# condor_ce_q 68.0
-- Schedd: ce01-lhcb-t2.cr.cnaf.infn.it : <131.154.192.120:30149> @ 
05/18/20 23:23:49
OWNER  BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
ops046 ID: 68       5/17 01:35      _      _      _      1      1 68.0
Cheers
Stefano
Il 18/05/20 22:07, Brian Lin ha scritto:
Hi Stefano,
I'm a little confused, your system periodic remove expressions seem to 
remove jobs that have been held for more than 8 hours whereas your 
queries are looking for held jobs whose proxies have been expired for 
more than 4 hours. I imagine there's some overlap but they seem like 
fairly different queries.
Though having the RemoveReason set that ways is pretty strange. If you 
set "SCHEDD_DEBUG = D_CAT D_ALWAYS:2", you may see some hints in the 
SchedLog as to why the Schedd is failing to remove these jobs.
Thanks,
Brian
On 5/16/20 11:05 AM, Stefano Dal Pra wrote:
Hello,
htcondor-ce-3.4.0-1.el7.noarch here.
We have a problem common to all of our CEs:
[root@ce02-htc ~]# condor_ce_q -cons '(JobStatus == 5 ) && (time() - 
x509UserProxyExpiration > 4 * 3600)' -af Owner | sort | uniq -c
   9592 user1
      4 user2
   1114 user3
    575 user4
     44 user5
I have set up REMOVE  and REMOVE REASON rule:
SYSTEM_PERIODIC_REMOVE = (JobStatus == 5 && CurrentTime - 
EnteredCurrentStatus > 3600*8)
SYSTEM_PERIODIC_REMOVE_REASON = strcat("CE job removed by 
SYSTEM_PERIODIC_REMOVE due to ", ifThenElse((JobStatus == 5 && 
CurrentTime - EnteredCurrentStatus > 3600*8), "being in the hold 
state for 8 hours.", ifThenElse((JobStatus == 5 && 
isUndefined(RoutedToJobId)), "non-existent route or entry in 
JOB_ROUTER_ENTRIES.", "input files missing." ) ) )
Inspecting these "non purged jobs", they have a RemoveReason set, but 
they are not gone nevertheless:
[root@ce02-htc ~]# condor_ce_q 1679707.0 -af JobStatus RemoveReason
5 CE job removed by SYSTEM_PERIODIC_REMOVE due to being in the hold 
state for 8 hours.
Until now i have no better way than removing these jobs manually 
using somethin like:
condor_ce_q -cons '(JobStatus == 5 ) && (time() - 
x509UserProxyExpiration > 4 * 3600)' -af 
'strcat(ClusterId,".",ProcId)' | xargs condor_ce_rm
Do i miss something obvious?
Cheers,
Stefano
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx 
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/