Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Keeping Parallel Universe job alive even node0 is done

Date: Fri, 30 Jan 2009 15:19:32 -0600
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [Condor-users] Keeping Parallel Universe job alive even node0 is done

Natarajan, Senthil wrote:

Hi,
I am trying to test simple MPICH2 example code (using condor 7.0.5,MPICH2 1.0.8), calculating pi value MPI code.
I am testing this with 3 nodes, as soon as node 0 is done, condor shutsdown node1 and node2 even though jobs on them did not finish.
I know it is the way condor suppose to work, but is there any workaround to keep node0 alive until all the nodes are done.


Yes.

In your job submit file that you give to condor_submit, add thefollowing line:


+ParallelShutdownPolicy = "WAIT_FOR_ALL"

(yes, it needs to start with a plus sign)

If the job attribute ParallelShutdownPolicy is set to the string"WAIT_FOR_ALL", then Condor will wait until every node in the paralleljob has completed to consider the job finished. If this attribute is notset, or is set to any other string, the default policy is in effect,which is when the first node exits, the whole job is considered done,and condor kills all other running nodes in that parallel job.


Hope this helps,
Todd

References:
- [Condor-users] Keeping Parallel Universe job alive even node0 is done
  - From: Natarajan, Senthil

Prev by Date: Re: [Condor-users] Problem running 7.2.0 on x86_64 RHEL5
Next by Date: Re: [Condor-users] watchdog pipe file missing
Previous by thread: [Condor-users] Keeping Parallel Universe job alive even node0 is done
Next by thread: [Condor-users] Does condor allow java program to run Runtime.exec() method?
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [Condor-users] Keeping Parallel Universe job alive even node0 is done