Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Getting DAG node to fail on file transfer error
- Date: Tue, 04 Nov 2014 19:55:28 +0000
- From: Brian Candler <b.candler@xxxxxxxxx>
- Subject: Re: [HTCondor-users] Getting DAG node to fail on file transfer error
On 03/11/2014 16:14, R. Kent Wenger wrote:
Yes, getting the job to fail instead of going on hold is totally
independent of whether the job is part of a DAG or not. So, you need
to add an appropriate periodic_remove expression to your submit file(s).
Thank you. DAGman will see the job status as failed, presumably.
What attribute should I look for if I want to remove *all* held jobs?
i.e. what's the right classAd attribute to look for to identify a job as
being held?
By experiment, a manually-held job has
HoldReason = "via condor_hold (by user brian)"
HoldReasonCode = 1
PeriodicHold = false
NumSystemHolds = 0
HoldReasonSubCode = 0
OnExitHold = false
and when subsequently released:
OnExitHold = false
LastHoldReasonSubCode = 0
LastHoldReasonCode = 1
NumSystemHolds = 0
PeriodicHold = false
LastHoldReason = "via condor_hold (by user brian)"
So maybe:
periodic_remove = HoldReasonCode =!= UNDEFINED
?
It seems to work: the downside is that the hold reason is lost (I just
get ULOG_JOB_ABORTED in dagman.out)
Regards,
Brian.