Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Rescue DAG and clusters
- Date: Wed, 11 Jul 2012 22:15:25 +0100
- From: Brian Candler <B.Candler@xxxxxxxxx>
- Subject: [Condor-users] Rescue DAG and clusters
The documentation for DAGman says:
"The failure of a single job within a cluster of multiple jobs (within a
single node) causes the entire cluster of jobs to fail. Any other jobs
within the failed cluster of jobs are immediately removed."
A simple test confirms this to be the case:
==> A.submit <==
cmd = /bin/sleep
args = 20
queue 10
==> B.submit <==
cmd = /bin/sleep
args = 30
queue 10
==> test.dag <==
JOB A A.submit
JOB B B.submit
PARENT A CHILD B
Killing any one of the 'sleep' condor_exec processes causes the others to be
killed, and a restart of the dag causes all the processes in that cluster to
be restarted from scratch.
So suppose job A and job B are doing useful work (e.g. a cluster processing
N files in parallel), and I need all the job A's to complete before the job
B's to start, but I want to retry individual failed jobs from A or B.
What's the best way to do this?
As far as I can see, I need to write out an explicit set of nodes and the
dependencies between them.
# A.submit
...
queue 1
# B.submit
...
queue 1
# A.dag
JOB A0 A.submit
VARS A0 runnumber="0"
JOB A1 A.submit
VARS A1 runnumber="1"
...
JOB A9 A.submit
VARS A9 runnumber="9"
# B.dag
JOB B0 B.submit
VARS B0 runnumber="0"
JOB B1 B.submit
VARS B1 runnumber="1"
...
JOB B9 B.submit
VARS B9 runnumber="9"
# test2.dag
SUBDAG EXTERNAL A A.dag
SUBDAG EXTERNAL B B.dag
PARENT A CHILD B
I've tested this and it works - but I have had to enumerate all 20 jobs
explicitly, instead of just having 2 clusters of 10 jobs. Is there any neat
way to avoid this, similar to the "queue N" parameter in a cluster?
Also, it's a bit slow to start. The first condor_dagman sits around for
about 10-15 seconds, and then starts the inner condor_dagman. That also
sits around for 10-15 seconds, before it starts submitting the 'A' jobs.
When those have completed, it takes a while to spawn the second inner
condor_dagman, and then some more time before the 'B' jobs.
Replacing "SUBDAG EXTERNAL" with "SPLICE" seems to help by getting rid of
the second layer of condor_dagman.
Is there any other parameter I can tweak to speed up the launching of jobs?
Thanks,
Brian.