Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] centrally force removal after some time even if leave_in_queue is true?
- Date: Wed, 07 Nov 2018 17:38:28 +0100
- From: Andrea Sartirana <sartiran@xxxxxxxxxxxx>
- Subject: Re: [HTCondor-users] centrally force removal after some time even if leave_in_queue is true?
Hi Todd,
your solution seems to work as the LeavJobInQueue classadd is changed
[1] and correctly evaluates to false
when some expiration time has passed [2]. But indeed, as Michael said,
it does not really fix my problem
since the jobs are not removed from the queue (in the sense that they
still appear in condor_q output).
Is this because something is not well configured on our schedd?
If not I guess only a cron running "condor_rm -xforce ..." can fix the
issue...
(anyways, job-transform seems indeed very powerful)
Regards,
Andrea
[1]
[root@llrmpicream ~]# condor_q -long 217401.0|grep InQueue
LeaveJobInQueue = ( JobStatus == 3 && ( time() - EnteredCurrentStatus )
> 500 ) ? false : SubmitterLeaveJobInQueue
SubmitterLeaveJobInQueue = ( CompletionDate =?= undefined ||
CompletionDate == 0 || ( ( CurrentTime - CompletionDate ) < 1800 ) )
[root@llrmpicream ~]# condor_q -constraint 'JobStatus ==3 &&
!LeaveJobInQueue'
[2]
-- Schedd: llrmpicream.in2p3.fr : <134.158.132.244:9125> @ 11/07/18 17:31:33
OWNERÂÂÂ BATCH_NAMEÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ SUBMITTEDÂÂ DONE RUNÂÂÂ
IDLEÂ TOTAL JOB_IDS
cmspilot CMD: CREAM467257994_jobWrapper.sh 11/7 16:21 _ _ÂÂÂÂÂ
_ÂÂÂÂÂ 1 217399.0
cmspilot CMD: CREAM078494348_jobWrapper.sh 11/7 16:21 _ _ÂÂÂÂÂ
_ÂÂÂÂÂ 1 217400.0
ops001 CMD: CREAM155506574_jobWrapper.sh 11/7 16:24 _ _ÂÂÂÂÂ
_ÂÂÂÂÂ 1 217401.0
ops000 CMD: CREAM056514266_jobWrapper.sh 11/7 16:42 _ _ÂÂÂÂÂ
_ÂÂÂÂÂ 1 217405.0
4 jobs; 0 completed, 4 removed, 0 idle, 0 running, 0 held, 0 suspended
[root@llrmpicream ~]#
On 31/10/2018 16:32, Todd Tannenbaum wrote:
On 10/31/2018 5:49 AM, Andrea Sartirana wrote:
Hi,
much is in the title.
I was wondering if there is a way to force removal from the queue of the
X state jobs after some centrally defined time even if the
leave_in_queue expression given by the user at submission still
evaluates to true. I'm running 8.6.0, vanilla universe, direct submission.
I've tried to include garbage collecting of the remove jobs in the
SYSTEM_PERIODIC_REMOVE but this does not seem to have the desired effect.
Regards
Andrea
Hi Andrea,
There may be an easier way, but a quick thought is you could use Job Transforms to accomplish the above. Job Transforms allow you, the administrator, to edit job classads upon submission --- see this section of the v8.6 manual:
http://htcondor.org/manual/v8.6/3_7Policy_Configuration.html#38930
So the idea here is to configure your schedd to edit the user's leave_in_queue expression (which ends up in the job classad as attribute LeaveJobInQueue) so that it will always evaluate to False for X state jobs after a specified amount of time, else fall back to whatever the user wanted.
Try appending the below to the HTCondor configuration (it will be used by your submit machines, and ignored on machines not running a schedd) to allow jobs in X state to leave the queue after 120 seconds regardless of what the user's submit file says:
JOB_TRANSFORM_NAMES = $(JOB_TRANSFORM_NAMES) LeaveInQueue
JOB_TRANSFORM_LeaveInQueue @=end
[
copy_LeaveJobInQueue = "SubmitterLeaveJobInQueue";
set_LeaveJobInQueue = (JobStatus == 3 && (time() - EnteredCurrentStatus) > 120) ? False : SubmitterLeaveJobInQueue
]
@end
Warning - the above is off the top of my head, I did not test it.
Seems like HTCondor would benefit from a SYSTEM_LEAVE_IN_QUEUE knob to make doing the above simpler. But Job Transforms are a pretty powerful generic tool.
Hope the above helps.
regards,
Todd