Hey Nicolas,
DAGMan RETRY is not very tunable. Its two features are just retry n-times and don't retry if received exit signal the one specified with the optional UNLESS-EXIT but to elaborate on your questions.
Best of luck,
Cole Bollig
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Nicolas Arnaud <nicolas.arnaud@xxxxxxxxxxxxxxx>
Sent: Friday, August 19, 2022 9:45 AM To: HTCondor Users <htcondor-users@xxxxxxxxxxx> Subject: [HTCondor-users] 2 questions about job retry Hello, I have a couple questions about how to tune the retry of a failed DAG job. 1) What's the best way to wait some seconds before attempting a retry? I've thought of using a POST script that would have $RETURN among its arguments and call |sleep| if $RETURN is not equal to 0, but I wonder whether that would work and whether there is a simpler way to do something similar. 2) When a job retries, I would like it *not* to run where the failed job has run. Searching on the web lead me to adding the line > requirements = Machine =!= LastRemoteHost to the submit file that is called by the JOB command on the DAG file, but that doesn't seem to work. More often than not, the job reruns in the same place (same machine and same slot) than the failed try. The Condor version I am using is > condor_version > $CondorVersion: 9.0.11 Mar 12 2022 BuildID: 578027 PackageID: 9.0.11-1 $ > $CondorPlatform: x86_64_CentOS7 $ Thanks in advance, Nicolas _______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/ |