Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Keeping Parallel Universe job alive even node0 is done
- Date: Fri, 30 Jan 2009 15:19:32 -0600
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [Condor-users] Keeping Parallel Universe job alive even node0 is done
Natarajan, Senthil wrote:
Hi,
I am trying to test simple MPICH2 example code (using condor 7.0.5,
MPICH2 1.0.8), calculating pi value MPI code.
I am testing this with 3 nodes, as soon as node 0 is done, condor shuts
down node1 and node2 even though jobs on them did not finish.
I know it is the way condor suppose to work, but is there any work
around to keep node0 alive until all the nodes are done.
Yes.
In your job submit file that you give to condor_submit, add the
following line:
+ParallelShutdownPolicy = "WAIT_FOR_ALL"
(yes, it needs to start with a plus sign)
If the job attribute ParallelShutdownPolicy is set to the string
"WAIT_FOR_ALL", then Condor will wait until every node in the parallel
job has completed to consider the job finished. If this attribute is not
set, or is set to any other string, the default policy is in effect,
which is when the first node exits, the whole job is considered done,
and condor kills all other running nodes in that parallel job.
Hope this helps,
Todd