[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Resubmitting failed job to a different site



> On Feb 8, 2023, at 4:52 AM, Jeff Templon <templon@xxxxxxxxx> wrote:
> 
> A user just asked me this question:
> 
> Using OSG sites, some jobs fail, if I just resubmit them they usually are retried at the same site and fail again.  Is there a way to ask to retry a job and explicitly avoid it being resubmitted to the same location?
> 
> Ps Not sure if itâs relevant, but theyâre using PEGASUS.  He showed me a pegasus config file which was full of GLIDEIN directives.


This is hard to do automatically if the job leaves the queue and is then resubmitted. Thereâs no good way within HTCondor to tie the two job instances together and have the execution history of the first instance influence how the second instance matches and runs.
It can be done by manually editing the submit file before resubmission, but thatâs hard to automate.

If the failure is easy to detect via the job exit code, you can use the retry directives (max_retries, retry_until, success_exit_code). That leaves the job in the queue after a failure, so the job can execute again. Then there are ways to record in the job ad where it ran before and avoid those places for future runs of the job.

The use of Pegasus may complicate this, if itâs deciding where to steer the jobs.

 - Jaime