Hi experts (Cole),
with ref. to Special DAGMan Macros in
https://htcondor.readthedocs.io/en/latest/automated-workflows/dagman-advance-functionality.html#referencing-macros-within-a-definition
But have a problem with how it works after a rescue DAG is
triggered.
We use the node retry value $RETRY in our POST script so that it
can tell if this
was the first invocation for a given node execution or it is being
run in deferral mode.
Here's the relevant part of a typical DagMan file
JOB Job1 Job.1.submit
SCRIPT DEFER 4 1800 POST Job1 dag_bootstrap.sh POSTJOB $JOBID
$RETURN $RETRY $MAX_RETRIES ... [other arguments]
RETRY Job1 2 UNLESS-EXIT 2
When the node fails 2 times Dagman exits and writes the Rescue DAG
(let's assume for
simplicity that there is only one node).
We then trigger it into execution by starting DagMan again while
ensuring that
DAGMAN_RESET_RETRIES_UPON_RESCUE
is set to True.
Nodes are tried again up to 2 more times, i.e. the reset works as
desired in that respect,
but the $RETRY macro value is not reset and sticks (in our case)
at 2 apparently forever.
Is this a bug ? Or an incorrect interpretation on our side ?
Of course we could track in some other way whether a specific
iteration of a POST script has called for being rerun after
deferral or not, but we think that if we could get this
information from DagMan itself, it would be more solid.
Thanks
Stefano