Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Rescue DAG and clusters
- Date: Thu, 12 Jul 2012 10:47:39 -0500
- From: Nathan Panike <nwp@xxxxxxxxxxx>
- Subject: Re: [Condor-users] Rescue DAG and clusters
On Wed, Jul 11, 2012 at 10:15:25PM +0100, Brian Candler wrote:
> The documentation for DAGman says:
>
> "The failure of a single job within a cluster of multiple jobs (within a
> single node) causes the entire cluster of jobs to fail. Any other jobs
> within the failed cluster of jobs are immediately removed."
>
> A simple test confirms this to be the case:
>
> ==> A.submit <==
> cmd = /bin/sleep
> args = 20
> queue 10
>
> ==> B.submit <==
> cmd = /bin/sleep
> args = 30
> queue 10
>
> ==> test.dag <==
> JOB A A.submit
> JOB B B.submit
> PARENT A CHILD B
>
> Killing any one of the 'sleep' condor_exec processes causes the others to be
> killed, and a restart of the dag causes all the processes in that cluster to
> be restarted from scratch.
>
> So suppose job A and job B are doing useful work (e.g. a cluster processing
> N files in parallel), and I need all the job A's to complete before the job
> B's to start, but I want to retry individual failed jobs from A or B.
> What's the best way to do this?
>
> As far as I can see, I need to write out an explicit set of nodes and the
> dependencies between them.
>
> # A.submit
> ...
> queue 1
>
> # B.submit
> ...
> queue 1
>
> # A.dag
> JOB A0 A.submit
> VARS A0 runnumber="0"
> JOB A1 A.submit
> VARS A1 runnumber="1"
> ...
> JOB A9 A.submit
> VARS A9 runnumber="9"
>
> # B.dag
> JOB B0 B.submit
> VARS B0 runnumber="0"
> JOB B1 B.submit
> VARS B1 runnumber="1"
> ...
> JOB B9 B.submit
> VARS B9 runnumber="9"
>
> # test2.dag
> SUBDAG EXTERNAL A A.dag
> SUBDAG EXTERNAL B B.dag
> PARENT A CHILD B
This is the best way to do it.
>
> I've tested this and it works - but I have had to enumerate all 20 jobs
> explicitly, instead of just having 2 clusters of 10 jobs. Is there any neat
> way to avoid this, similar to the "queue N" parameter in a cluster?
>
The format above is not too bad, as it is easy to write a script for
such a DAG.
> Also, it's a bit slow to start. The first condor_dagman sits around for
> about 10-15 seconds, and then starts the inner condor_dagman. That also
> sits around for 10-15 seconds, before it starts submitting the 'A' jobs.
> When those have completed, it takes a while to spawn the second inner
> condor_dagman, and then some more time before the 'B' jobs.
>
Yes, that is because if you use SUBDAG EXTERNAL, it has to spawn another
condor_dagman, which waits again. Splicing does not incur the cost of
another condor_dagman
> Replacing "SUBDAG EXTERNAL" with "SPLICE" seems to help by getting rid of
> the second layer of condor_dagman.
>
> Is there any other parameter I can tweak to speed up the launching of jobs?
>
> Thanks,
>
> Brian.