Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] 2 questions about job retry
- Date: Mon, 22 Aug 2022 17:04:47 -0500
- From: Greg Thain <gthain@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] 2 questions about job retry
On 8/19/22 09:45, Nicolas Arnaud wrote:
Hello,
I have a couple questions about how to tune the retry of a failed DAG
job.
1) What's the best way to wait some seconds before attempting a retry?
I've thought of using a POST script that would have $RETURN among its
arguments and call |sleep| if $RETURN is not equal to 0, but I wonder
whether that would work and whether there is a simpler way to do
something similar.
This is not bad, and would be my first recommendation.
2) When a job retries, I would like it *not* to run where the failed
job has run. Searching on the web lead me to adding the line
requirements = Machine =!= LastRemoteHost
to the submit file that is called by the JOB command on the DAG file,
but that doesn't seem to work. More often than not, the job reruns in
the same place (same machine and same slot) than the failed try.
There's a somewhat over-engineered version of this that supports hold
and release at
https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToAutoRetryElsewhere
But the basic mechanisms with the Requirements should be what you need.
-greg