Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] How to troubleshoot MPI job
- Date: Tue, 15 Feb 2005 13:14:37 -0600
- From: Erik Paulson <epaulson@xxxxxxxxxxx>
- Subject: Re: [Condor-users] How to troubleshoot MPI job
On Tue, Feb 15, 2005 at 05:04:45PM +0800, Nigel Teow wrote:
> Hi,
>
> Had installed condor (version 6.6.8) on a cluster,
>
> Am able to use condor_submit to run the mpi job on a single node but
> when I tried to run on 2 nodes, it fails. Following are the output files,
>
> outfile.0
> -----------
> p0_28434: p4_error: Child process exited while making connection to
> remote process on compute-0-1.local: 0
> p0_28434: (2.007812) net_send: could not write to fd=4, errno = 32
>
> outfile.1
> -----------
> rm_28438: (-) net_recv failed for fd = 3
> rm_28438: p4_error: net_recv read, errno = : 104
>
That looks like an error with MPICH 1.2.5 or later. Use 1.2.4
-Erik