[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] DAGMAN $RETRY macro appears stuck



Hi Stefano,

The issue is the difference in the amount of state restored between rescue and recovery. In the recovery case the full state the DAG is restored which is what allowed you to do the postscript check for previous executions. For rescue, only the state of successfully completed nodes and potentially the number of remaining retries for partially executed nodes is restored. Otherwise, the iteration of the DAG is a fresh invocation (thus leading to the number of retries being zero). Currently there is no first-class way in DAGMan to discern if a node previously executed during a previous DAG execution.

-Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Stefano Belforte via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Monday, November 3, 2025 10:04 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Stefano Belforte <stefano.belforte@xxxxxxx>
Subject: [HTCondor-users] DAGMAN $RETRY macro appears stuck
 

Hi experts (Cole),

with ref.  to Special DAGMan Macros in

https://htcondor.readthedocs.io/en/latest/automated-workflows/dagman-advance-functionality.html#referencing-macros-within-a-definition

But have a problem with how it works after a rescue DAG is triggered.

We use the node retry value $RETRY in our POST script so that it can tell if this
was the first invocation for a given node execution or it is being run in deferral mode.

Here's the relevant part of a typical DagMan file

JOB Job1 Job.1.submit
SCRIPT DEFER 4 1800 POST Job1 dag_bootstrap.sh POSTJOB $JOBID $RETURN $RETRY $MAX_RETRIES ... [other arguments]
RETRY Job1 2 UNLESS-EXIT 2

When the node fails 2 times Dagman exits and writes the Rescue DAG (let's assume for
simplicity that there is only one node).

We then trigger it into execution by starting DagMan again while ensuring that
DAGMAN_RESET_RETRIES_UPON_RESCUE is set to True.

Nodes are tried again up to 2 more times, i.e. the reset works as desired in that respect,
but the $RETRY macro value is not reset and sticks (in our case) at 2 apparently forever.

Is this a bug ? Or an incorrect interpretation on our side ?

Of course we could track in some other way whether a specific iteration of a POST script has called for being rerun after deferral or not, but we think that if we could get this information from DagMan itself, it would be more solid.

Thanks

Stefano