Todd.
I
probably need to go over my sub files.
Thank
you very much.
Romain
I
think it will be great.
Fyi
I dont remember the exact configuration.
The
logic will be.
1.
Put jobs on hold when exit code !=0
2.
Autorelease jobs when exit code !=0 and number of
starts is less than 3
This should work.
I
suggest you to create a single submit file and run a
script that exit with status code. you will be able to
check it quickly
Have
a look at the documentation.
Drop
me a note if you need an example.
Thanks
David
Hi folks,
Just chiming in here:
While David's suggestion above would work, there is no
need to place jobs on hold and autorelease jobs...
current versions of HTCondor have an easier to
use/understand mechanism to simply retry jobs that exit
without a successful exit code. In the condor_submit man
page at
https://htcondor.readthedocs.io/en/feature/man-pages/condor_submit.html
take a look at the definitions for max_retries,
success_exit_code, and retry_until. Also take a look at
the following section in the manual:
https://htcondor.readthedocs.io/en/feature/users-manual/automatic-job-management.html?highlight=max_retries#automatically-rerunning-a-failed-job
The decision to place retry policy directly into your job
submission, or alternatively to use DAGMan to manage
retries, largely depends on if your job requires the PRE
and POST script functionality that DAGMan brings to the
table. For example, if you can determine if your job
succeeded or failed based on just the exit status or other
attributes reflected in the job classad (like runtime, for
instance), then likely no need to involve DAGMan to handle
retries and you can simply specify max_tries and/ir
retry_until in your job submit file. On the other hand,
if determining if a job succeeded requires running some
procedural code (e.g. a script that does some sanity and
completeness checking on the output files), then using
DAGMan's retry functionality in concert with POST scripts
is what I would recommend.
Hope the above helps
Todd