I'm trying to get a test MPI-universe job to run but have run into the
following problem. Firstly, I'm using mpich v1.2.4 built with the
following options:
Hence note we're using ssh to communicate over. Now, I've set up the
dedicated VM1_USER on each machine to have passwordless ssh access
between all MPI enabled nodes (by copying the ssh keys), and I can
indeed run jobs as that user using mpirun successfully. The problem
arises when I submit the same job as a Condor job. I've set up a
dedicated submit node as mentioned in the manual, and tried an initial
job which just two nodes. The first execute node goes into a
Claimed/Busy state while the second goes into a Claimed/Idle state
before finally the job fails with the following sent to the stdout of
the first execute node:
p0_7062: p4_error: Timeout in making connection to remote process on
<2nd execute nodename>: 0
p0_7062: (302.027937) net_send: could not write to fd=4, errno = 32
Now my guess is that the execute nodes are using
$MPI_CONDOR_RSH_PATH/rsh to communicte, whereas our mpich is built to
use ssh, right? Hence, my question is: is there any way we can cajole
Condor to use ssh, or do we have to rebuild our mpich to use rsh? We're
really keen to use ssh if at all possible.