Re: [HTCondor-users] Queries regarding reset retries in rescue dag

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Date: Wed, 22 Oct 2025 13:52:22 +0000

From: Cole Bollig <cabollig@xxxxxxxx>

Subject: Re: [HTCondor-users] Queries regarding reset retries in rescue dag

Hi Vijay,

DAGMAN_RESET_RETRIES_UPON_RESCUE should default to true. That being said, this option is only a configuration option so there is no command line equivalent. If this is the only value, you are setting (and thus don't need a full config file) you can set the configuration option via an environment variable with the ENV command in the DAG file as ENV SET _CONDOR_DAGMAN_RESET_RETRIES_UPON_RESCUE=True.

As for checking the value is set, unfortunately the only way to verify is by looking for the option in the big dump of configuration settings at the top of the *.dagman.out. If you use the CONFIG command then before the dump of all configuration settings should be the line 'Using DAGMan config file: <provided config file>'.

I should update this line of documentation, as the retries line is always added to the rescue file regardless of when the rescue file is generated i.e. before all retries are exhausted or not.

Yes, the Rescue DAG file only has retry entries for nodes that declared so in the original DAG file.
As I mentioned in B), the retry for a node is always printed to the rescue file regardless of the before or after all retry exhaustion (I may change this after the discussion). The behavior is that upon rescue the RETRY statement is written into the rescue with the same value declared in the original DAG file when the configuration option is true. If the option is false, then the RETRY entry will have the value of remaining retires. Meaning if you say node A should retry five times and it already exhausted three retries, then the rescue file will say RETRY A 2 since there are two remaining retries.
The PRE/POST scripts are only written when partial rescue files are disabled. This is because full and partial rescue file work differently. The full rescue file is supposed to replace the original DAG file during resubmission i.e. condor_submit_dag my.dag -> condor_submit_dag my.dag.rescue001 while the partial rescue works just a state file to be digested after the original DAG is parsed. Thus, with partial rescues, we get the node scripts from original DAG file and restore state from the rescue (what nodes have completed successfully and what retries exist.

As an aside, the full rescue file support has been deprecated because DAGMan does not re-write all the commands from the original DAG file and thus will change between executions. Plus, the partial rescue is cleaner and handled automatically for the user.

Sorry for the massive wall of text. Hope this helps and let me know if you have any follow-up questions.

Cheers,

Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Vijay Chakravarty via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Wednesday, October 22, 2025 6:52 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Vijay Chakravarty <vijay.chakravarty@xxxxxxx>
Subject: [HTCondor-users] Queries regarding reset retries in rescue dag

Dear HTCondor experts,

We have been trying to make use of the configuration variable DAGMAN_RESET_RETRIES_UPON_RESCUE by setting it to True in a config file that we refer to with CONFIG command in the dag file. However, it doesn't seem to make any difference to the Retries. We have a simple RETRY Job <no. of retries> command that we want to reset when the number of retries are done and failure persists leading to writing of rescue file along with some PRE and POST scripts. We also don't find the variable set in the output dagman.out.

In the backdrop of this problem, I have the following questions:

A.) Is there a conclusive way to make sure our config file is being read and the config variable being set to True besides checking for the variable in the output file? Or an alternative way to set this variable?

B.) The manual mentions the line "If the Rescue DAG file is generated before all retries of a node are completed, then the Rescue DAG file will also contain RETRY entries."

I want to confirm that this "RETRY entries" is just copying the RETRY lines from the dagman script. For instance, if my DAG file didn't have a RETRY command to begin with, would I still get a RETRY entry in the rescue?
Secondly, the word "before": Our use case envisions using the latest rescue dag file generated after all the retries are completed for the first time. Once the retries are completed, we do a resubmission that uses the latest rescue file which we want to reset the retries for, so effectively run the number of retries again. Is that not possible? What we find currently is that rescue files have only the RETRY line (same as that we pass in the actual DAG) It seems that it is copying the RETRY entry from original DAG and the reset never happens.
Also, it's my understanding that the rescue file should copy the original DAG PRE Job and POST lines verbatim in the rescue file from the dag file. However, we find that the rescue file has only the RETRY line after the header comments about failed DAG execution. I suspect since the Rescue file has no other line it is not working as expected. Maybe this can be resolved by setting DAGMAN_WRITE_PARTIAL_RESCUE to False.

If we could understand how to set the Configuration variables, it would be very helpful.

Any insight will be highly appreciated.

Thanks!

Cheers,

Vijay Chakravarty

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Queries regarding reset retries in rescue dag