On Tue, Feb 15, 2005 at 05:04:45PM +0800, Nigel Teow wrote:
> Hi,
>
> Had installed condor (version 6.6.8) on a cluster,
>
> Am able to use condor_submit to run the mpi job on a single node but
> when I tried to run on 2 nodes, it fails. Following are the output files,
>
> outfile.0
> -----------
> p0_28434: p4_error: Child process exited while making connection to
> remote process on compute-0-1.local: 0
> p0_28434: (2.007812) net_send: could not write to fd=4, errno = 32
>
> outfile.1
> -----------
> rm_28438: (-) net_recv failed for fd = 3
> rm_28438: p4_error: net_recv read, errno = : 104
>
That looks like an error with MPICH 1.2.5 or later. Use 1.2.4
-Erik
|