[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] DAGMAN $RETRY macro appears stuck



Hi experts (Cole),

with ref.  to Special DAGMan Macros in

https://htcondor.readthedocs.io/en/latest/automated-workflows/dagman-advance-functionality.html#referencing-macros-within-a-definition

But have a problem with how it works after a rescue DAG is triggered.

We use the node retry value $RETRY in our POST script so that it can tell if this
was the first invocation for a given node execution or it is being run in deferral mode.

Here's the relevant part of a typical DagMan file

JOB Job1 Job.1.submit
SCRIPT DEFER 4 1800 POST Job1 dag_bootstrap.sh POSTJOB $JOBID $RETURN $RETRY $MAX_RETRIES ... [other arguments]
RETRY Job1 2 UNLESS-EXIT 2

When the node fails 2 times Dagman exits and writes the Rescue DAG (let's assume for
simplicity that there is only one node).

We then trigger it into execution by starting DagMan again while ensuring that
DAGMAN_RESET_RETRIES_UPON_RESCUE is set to True.

Nodes are tried again up to 2 more times, i.e. the reset works as desired in that respect,
but the $RETRY macro value is not reset and sticks (in our case) at 2 apparently forever.

Is this a bug ? Or an incorrect interpretation on our side ?

Of course we could track in some other way whether a specific iteration of a POST script has called for being rerun after deferral or not, but we think that if we could get this information from DagMan itself, it would be more solid.

Thanks

Stefano