Re: [Condor-users] How to troubleshoot MPI job


Date: Tue, 15 Feb 2005 13:14:37 -0600
From: Erik Paulson <epaulson@xxxxxxxxxxx>
Subject: Re: [Condor-users] How to troubleshoot MPI job
On Tue, Feb 15, 2005 at 05:04:45PM +0800, Nigel Teow wrote:
> Hi,
> 
> Had installed condor (version 6.6.8) on a cluster,
> 
> Am able to use condor_submit to run the mpi job on a single node but 
> when I tried to run on 2 nodes, it fails. Following are the output files,
> 
> outfile.0
> -----------
> p0_28434:  p4_error: Child process exited while making connection to 
> remote process on compute-0-1.local: 0
> p0_28434: (2.007812) net_send: could not write to fd=4, errno = 32
> 
> outfile.1
> -----------
> rm_28438: (-) net_recv failed for fd = 3
> rm_28438:  p4_error: net_recv read, errno = : 104
> 

That looks like an error with MPICH 1.2.5 or later. Use 1.2.4

-Erik

[← Prev in Thread] Current Thread [Next in Thread→]