$CondorVersion: 8.2.1 Jun 27 2014 BuildID: 256063 $
$CondorPlatform: x86_64_Windows8 $
I don't understand the RETRY keyword in DAGMan.
We have a task that runs 500-2000 jobs in our Windows 7 pool. All must run for the task to be successful (a calibration of a numerical model), thus, I want to retry any jobs that fail. So, reading the manual, I put for instance
JOB 0 dsm2.sub
VARS 0 JOBNO="$(JOB)"
RETRY 0 3
.....
and so forth for all jobs in the .dagman file, with the intention that any job that failed would be retried up to 3 times.
Well, two jobs did fail (from the rescue file):
# Total number of Nodes: 532
# Nodes premarked DONE: 530
# Nodes that failed: 2
# 164,280,<ENDLIST>
But on re-submitting the .dagman file, it re-ran all jobs. Is this because all were marked to retry? (same rescue file):
DONE 0
RETRY 0 3
DONE 1
RETRY 1 3
DONE 2
RETRY 2 3
DONE 3
.....