The dag submissions and the condor master are on the same Windows 2000 Server machine. When the dags are submitted from a different Windows XP machine to the same condor master, even at the same time as the ones giving us problems, things seem to be OK (at least we can't recall seeing this problem which this scenario). Resubmitting the same .dag and .sub files in the same way at another time will work just fine. All files are local to the submitting machine.
We have a master DAG with a couple thousand dag JOBS, that is, master.dag contains:
JOB dir1 dir1/testcase.dag.condor.sub JOB dir2 dir2/testcase.dag.condor.sub JOB dir3 dir3/testcase.dag.condor.sub ... and so on ... The testcase.dag files contain: JOB rmt_ dir1 dir1/testcase.sub SCRIPT PRE rmt_dir1 prepare.bat <args....> SCRIPT POST rmt_dir1 process.bat <args...> Questions: - Has anyone else experienced this and have a solution?- Is there something inherently wrong with submitting DAGMAN jobs on the condor master? - Is there a way to catch the failure and have the testcase.dag restarted or resubmitted?
For anyone interested in delving further into this I've attached examples of testcase.sub, testcase.dag and all the outputs (including testcase.dag.dagman.out) for a run that failed (I've not included the master.dag).
Thanks, Bob Mortensen
Attachment:
testcase.dag.lib.stdout
Description: application/applefile
Attachment:
testcase.dag
Description: Binary data
Attachment:
testcase.sub
Description: Binary data
Attachment:
testcase.dag.condor.sub
Description: Binary data
Attachment:
testcase.dag.dagman.out
Description: Binary data
Attachment:
testcase.dag.lib.stderr
Description: application/applefile