Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] 2 questions about job retry

Date: Mon, 22 Aug 2022 17:04:47 -0500
From: Greg Thain <gthain@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] 2 questions about job retry

On 8/19/22 09:45, Nicolas Arnaud wrote:

Hello,
I have a couple questions about how to tune the retry of a failed DAGjob.
1) What's the best way to wait some seconds before attempting a retry?
I've thought of using a POST script that would have $RETURN among itsarguments and call |sleep| if $RETURN is not equal to 0, but I wonderwhether that would work and whether there is a simpler way to dosomething similar.



This is not bad, and would be my first recommendation.

2) When a job retries, I would like it *not* to run where the failedjob has run. Searching on the web lead me to adding the line
requirements = Machine =!= LastRemoteHost
to the submit file that is called by the JOB command on the DAG file,but that doesn't seem to work. More often than not, the job reruns inthe same place (same machine and same slot) than the failed try.

There's a somewhat over-engineered version of this that supports holdand release at


https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToAutoRetryElsewhere

But the basic mechanisms with the Requirements should be what you need.


-greg

References:
- [HTCondor-users] 2 questions about job retry
  - From: Nicolas Arnaud

Prev by Date: Re: [HTCondor-users] HTCondor file-transfer vs networked storage
Next by Date: Re: [HTCondor-users] HTCondor and Wine and Ubuntu
Previous by thread: Re: [HTCondor-users] 2 questions about job retry
Next by thread: [HTCondor-users] Help Excluding node from preemption
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] 2 questions about job retry