Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Inconsistencies: hold versus abort
- Date: Tue, 2 Dec 2014 16:05:33 -0600 (CST)
- From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Inconsistencies: hold versus abort
On Tue, 2 Dec 2014, Brian Candler wrote:
Case (1): missing local file
...
If you set a NODE_STATUS_FILE it won't help you: it shows
DagStatus = 3; /* "STATUS_SUBMITTED ()" */
...
NodeStatus = 1; /* "STATUS_READY" */
It seems odd that the NODE_STATUS_FILE is not updated when dagman terminates
- I'd expected the DagStatus to show STATUS_ERROR, and probably also the node
which couldn't be submitted.
What version of DAGMan are you running? In 8.2.3 we fixed a bug that
could cause the node status file to not get updated when DAGMan exits.
When I try this, I get the following for the node status:
[
Type = "NodeStatus";
Node = "NodeA";
NodeStatus = 6; /* "STATUS_ERROR" */
StatusDetails = "Job submit failed";
RetryCount = 0;
JobProcsQueued = 0;
JobProcsHeld = 0;
]
Hopefully this is what you would want.
Case (2): missing or temporarily unavailable remote file
...
- on a personal condor this silently succeeds, in the sense that it makes no
attempt to transfer a file :-(
This will have to be dealt with at a level below DAGMan, because if
HTCondor claims the job succeeded, DAGMan doesn't have a way to know
otherwise (unless you add a POST script that checks the output).
However, if you set "should_transfer_files = true", then you get the same
behaviour as the following.
- on a proper cluster with separate submit and execution nodes and different
filesystem domains, the job goes into "held" status.
You can find this from the NODE_STATUS_FILE by looking for
JobProcsHeld = 1;
and JOBSTATE_LOG shows
1417548893 t2 JOB_HELD 4486.0 - - 1
But in both cases you don't get any indication of *why* it was held, and not
in the <dag>.dagman.out file either. You have to use condor_q -analyze <pid>
and parse its output:
You could find the hold reason in the DAGMan nodes.log file, or the log
file specified in the submit file (if you specify one).
Also, as far as I can tell there are no automatic retries (those would have
to be done by condor_startd, presumably?)
As far as DAGMan is concerned, a job that's on hold is still possibly
going to succeed. So if you want the job to fail, you need to put a
periodic_remove expression into your submit file that removes the job
after it's been on hold for a certain amount of time. Then you could add
retries to your DAG node.
Case (3): invalid scheme
...
In this case the job hangs in "Idle" state, with an unmatched requirements
expression.
Again, DAGMan doesn't know why the job is idle, so it will just wait
around for the job to finish.
Kent Wenger
CHTC Team