Hi Brian,
sorry for the 8 hours vs 4 hours confusion. Jobs stay there much
longer anyway.
I have set the debug level as you said (on a brand new CE working
with "ops" jobs only until now).
I also reduced the remove policy to 2 hours (to be sure there is
something to purge).
Before reconfiguring i selected jobid on hold for more than 16 hours
and found 12 such jobs:
[root@ce01-lhcb-t2 ~]# condor_ce_q -cons '(JobStatus == 5 ) &&
(time() - x509UserProxyExpiration > 16 * 3600)'
-- Schedd: ce01-lhcb-t2.cr.cnaf.infn.it : <131.154.192.120:28493> @
05/18/20 22:27:32
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL
JOB_IDS
ops046 ID: 66 5/16 21:35 _ _ _ 1 1 66.0
ops046 ID: 67 5/16 23:35 _ _ _ 1 1 67.0
ops046 ID: 68 5/17 01:35 _ _ _ 1 1 68.0
ops046 ID: 69 5/17 03:35 _ _ _ 1 1 69.0
ops046 ID: 70 5/17 05:35 _ _ _ 1 1 70.0
ops046 ID: 71 5/17 07:35 _ _ _ 1 1 71.0
ops046 ID: 72 5/17 09:35 _ _ _ 1 1 72.0
ops046 ID: 73 5/17 11:35 _ _ _ 1 1 73.0
ops046 ID: 74 5/17 13:35 _ _ _ 1 1 74.0
ops046 ID: 75 5/17 15:35 _ _ _ 1 1 75.0
ops046 ID: 76 5/17 17:35 _ _ _ 1 1 76.0
ops046 ID: 77 5/17 19:35 _ _ _ 1 1 77.0
After a condor_ce_reconfig (i actually did a restart too) a few of
them are gone, and a few are still there:
[root@ce01-lhcb-t2 ~]# condor_ce_q 66.0 67.0 68.0 69.0 70.0 71.0
72.0 73.0 74.0 75.0 76.0 77.0
-- Schedd: ce01-lhcb-t2.cr.cnaf.infn.it : <131.154.192.120:30149> @
05/18/20 22:58:16
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL
JOB_IDS
ops046 ID: 68 5/17 01:35 _ _ _ 1 1 68.0
ops046 ID: 69 5/17 03:35 _ _ _ 1 1 69.0
ops046 ID: 72 5/17 09:35 _ _ _ 1 1 72.0
ops046 ID: 73 5/17 11:35 _ _ _ 1 1 73.0
ops046 ID: 76 5/17 17:35 _ _ _ 1 1 76.0
ops046 ID: 77 5/17 19:35 _ _ _ 1 1 77.0
Total for query: 6 jobs; 0 completed, 0 removed, 0 idle, 0 running,
6 held, 0 suspended
Total for all users: 15 jobs; 5 completed, 0 removed, 0 idle, 0
running, 10 held, 0 suspended
The SchedLog after reconfig has, for job 71.0 (this has been removed):
05/18/20 22:33:05 (D_ALWAYS:2) abort_job_myself: 71.0 action:Remove
log_hold:true
05/18/20 22:33:05 (D_ALWAYS:2) Cleared dirty attributes for job 71.0
05/18/20 22:33:05 (D_ALWAYS:2) Writing record to user
logfile=/var/lib/condor-ce/spool/71/0/cluster71.proc0.subproc0/gridjob.log
owner=ops046
05/18/20 22:33:05 (D_ALWAYS:2) WriteUserLog::initialize: opened
/var/lib/condor-ce/spool/71/0/cluster71.proc0.subproc0/gridjob.log
successfully
05/18/20 22:33:05 (D_ALWAYS:2) WriteUserLog::user_priv_flag (~) is 0
05/18/20 22:33:05 (D_ALWAYS) Job 71.0 aborted: CE job removed by
SYSTEM_PERIODIC_REMOVE due to being in the hold state for 2 hours.
Looking for job 68.0 (not removed) however, there is nothing after
reconfiguration time (22:30):
05/18/20 21:30:34 (D_ALWAYS:2) abort_job_myself: 68.0 action:Hold
log_hold:true
05/18/20 21:30:34 (D_ALWAYS:2) Cleared dirty attributes for job 68.0
05/18/20 21:30:34 (D_ALWAYS:2) Writing record to user
logfile=/var/lib/condor-ce/spool/68/0/cluster68.proc0.subproc0/gridjob.log
owner=ops046
05/18/20 21:30:34 (D_ALWAYS:2) WriteUserLog::initialize: opened
/var/lib/condor-ce/spool/68/0/cluster68.proc0.subproc0/gridjob.log
successfully
05/18/20 21:30:34 (D_ALWAYS:2) WriteUserLog::user_priv_flag (~) is 0
05/18/20 21:30:34 (D_ALWAYS:2) SelfDrainingQueue
act_on_job_myself_queue is empty, not resetting timer
And the job is still there:
[root@ce01-lhcb-t2 ~]# condor_ce_q 68.0
-- Schedd: ce01-lhcb-t2.cr.cnaf.infn.it : <131.154.192.120:30149> @
05/18/20 23:23:49
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL
JOB_IDS
ops046 ID: 68 5/17 01:35 _ _ _ 1 1 68.0
Cheers
Stefano
Il 18/05/20 22:07, Brian Lin ha scritto:
Hi Stefano,
I'm a little confused, your system periodic remove expressions seem
to remove jobs that have been held for more than 8 hours whereas
your queries are looking for held jobs whose proxies have been
expired for more than 4 hours. I imagine there's some overlap but
they seem like fairly different queries.
Though having the RemoveReason set that ways is pretty strange. If
you set "SCHEDD_DEBUG = D_CAT D_ALWAYS:2", you may see some hints
in the SchedLog as to why the Schedd is failing to remove these jobs.
Thanks,
Brian
On 5/16/20 11:05 AM, Stefano Dal Pra wrote:
Hello,
htcondor-ce-3.4.0-1.el7.noarch here.
We have a problem common to all of our CEs:
[root@ce02-htc ~]# condor_ce_q -cons '(JobStatus == 5 ) && (time()
- x509UserProxyExpiration > 4 * 3600)' -af Owner | sort | uniq -c
9592 user1
4 user2
1114 user3
575 user4
44 user5
I have set up REMOVE and REMOVE REASON rule:
SYSTEM_PERIODIC_REMOVE = (JobStatus == 5 && CurrentTime -
EnteredCurrentStatus > 3600*8)
SYSTEM_PERIODIC_REMOVE_REASON = strcat("CE job removed by
SYSTEM_PERIODIC_REMOVE due to ", ifThenElse((JobStatus == 5 &&
CurrentTime - EnteredCurrentStatus > 3600*8), "being in the hold
state for 8 hours.", ifThenElse((JobStatus == 5 &&
isUndefined(RoutedToJobId)), "non-existent route or entry in
JOB_ROUTER_ENTRIES.", "input files missing." ) ) )
Inspecting these "non purged jobs", they have a RemoveReason set,
but they are not gone nevertheless:
[root@ce02-htc ~]# condor_ce_q 1679707.0 -af JobStatus RemoveReason
5 CE job removed by SYSTEM_PERIODIC_REMOVE due to being in the
hold state for 8 hours.
Until now i have no better way than removing these jobs manually
using somethin like:
condor_ce_q -cons '(JobStatus == 5 ) && (time() -
x509UserProxyExpiration > 4 * 3600)' -af
'strcat(ClusterId,".",ProcId)' | xargs condor_ce_rm
Do i miss something obvious?
Cheers,
Stefano
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/