Dear all,
I
am sending you this email as I would like to know if it is possible to
retry failed jobs with condor DAG (condor_submit_dag command) before the rescue file is created.
Basically I am sending around 500 jobs + 1 final job that is in charged of adding results from previous jobs.
I use the DAG feature for that and have set the
 RETRY ALL_NODES 2
in the submit file which enables to retry each job up to 2 times in case a transient failure occurs.
Lately
the machines I am running on are quite unstable and so some of the 500
jobs can crash even with 2 retries (this happens when opening a text
file for instance which is obviously a transient error). The crash open
at the beginning and the 500 jobs are running for quite long so I can
really spot easily the one that failed.
I would
prefer not to have to increase that RETRY ALL_NODES to a higher value
as I would like first to inspect if there is not something wrong in my
code before re-submitting the failed jobs.
So I would like to be
able to resubmit jobs that failed before the 500 jobs are "done" either
failing or finishing successfully.
The thing is the rescue file
which enables the resubmission of failed jobs is only created at the end when all
jobs have finished (either they failed or finished successfully).