[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Is $DAG_STATUS available elsewhere?



In the documentation at
http://research.cs.wisc.edu/htcondor/manual/v8.0/2_10DAGMan_Applications.html#sec:DAGFinalNode
it defines $DAG_STATUS and means for values 0 to 6.

Question: is this value available anywhere else, e.g. in the dagman log or jobstate_log? If so, I cannot find them. Does this mean I have to add a dummy FINAL node to a job, purely to capture the $DAG_STATUS?

I have done a few experiments.

* The exit code of condor_dagman for any DAG failure appears to be always 1, unless you use ABORT-DAG-ON in which case it's the exit status of the failed node, or the replacement value from the RETURN clause.

* Here is a simple example where the $DAG_STATUS is 2, but I cannot find this value of 2 anywhere in the log files.

==> dag1.dag <==
JOB J1 nonexistent.sub
JOBSTATE_LOG jobstate1.log

$ condor_submit_dag -f dag1.dag
$ condor_wait dag1.dag.dagman.log
$ cat jobstate1.log
1401275749 INTERNAL *** DAGMAN_STARTED 3655962.0 ***
1401275762 J1 SUBMIT_FAILURE - - - 1
1401275767 J1 SUBMIT_FAILURE - - - 1
1401275772 J1 SUBMIT_FAILURE - - - 1
1401275777 J1 SUBMIT_FAILURE - - - 1
1401275788 J1 SUBMIT_FAILURE - - - 1
1401275805 J1 SUBMIT_FAILURE - - - 1
1401275805 INTERNAL *** DAGMAN_FINISHED 1 ***
$ cat dag1.dag.dagman.out
... can't find anything relevant ...

Adding a FINAL node shows that $DAG_STATUS is indeed 2:

==> dag2.dag <==
JOB J1 nonexistent.sub
JOBSTATE_LOG jobstate2.log
FINAL final /dev/null NOOP
SCRIPT POST final final.sh $DAG_STATUS $FAILED_COUNT

==> final.sh <==
#!/bin/sh
echo "exiting with code $1" >/tmp/final.log
exit $1

$ condor_submit_dag dag2.dag
$ condor_wait dag2.dag.dagman.log
$ cat /tmp/final.log
exiting with code 2
$ cat jobstate2.log
1401275842 INTERNAL *** DAGMAN_STARTED 3655965.0 ***
1401275855 J1 SUBMIT_FAILURE - - - 1
1401275860 J1 SUBMIT_FAILURE - - - 1
1401275865 J1 SUBMIT_FAILURE - - - 1
1401275870 J1 SUBMIT_FAILURE - - - 1
1401275881 J1 SUBMIT_FAILURE - - - 1
1401275898 J1 SUBMIT_FAILURE - - - 1
1401275903 final SUBMIT 0.2147483647 - - 2
1401275903 final JOB_TERMINATED 0.2147483647 - - 2
1401275903 final JOB_SUCCESS 0 - - 2
1401275903 final POST_SCRIPT_STARTED 0.2147483647 - - 2
1401275903 final POST_SCRIPT_TERMINATED 0.2147483647 - - 2
1401275903 final POST_SCRIPT_FAILURE 0.2147483647 - - 2
1401275908 INTERNAL *** DAGMAN_FINISHED 1 ***

The value '2' here is I believe the sequence number, not the $DAG_STATUS or the exit code from the POST script.

So I'd like to retrieve the $DAG_STATUS value to report on why a DAG failed, but is there any way to get it other than in a FINAL script?

Thanks,

Brian.