Hi Erik,
thanks for your reply,
My qns are, 1. How do I check which mpich Condor is using to run the job? 2. What does Condor need from mpich to run the mpi job? 3. Is there a way to monitor/observe how Condor runs the mpi job? 4. How does Condor know which machines to use?
thanks, Nigel
Erik Paulson wrote:
On Tue, Feb 15, 2005 at 05:04:45PM +0800, Nigel Teow wrote:
Hi,
Had installed condor (version 6.6.8) on a cluster,
Am able to use condor_submit to run the mpi job on a single node but when I tried to run on 2 nodes, it fails. Following are the output files,
outfile.0
-----------
p0_28434: p4_error: Child process exited while making connection to remote process on compute-0-1.local: 0
p0_28434: (2.007812) net_send: could not write to fd=4, errno = 32
outfile.1 ----------- rm_28438: (-) net_recv failed for fd = 3 rm_28438: p4_error: net_recv read, errno = : 104
That looks like an error with MPICH 1.2.5 or later. Use 1.2.4
-Erik
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users