Hi, I am submitting MPI jobs, using MPICH1.2.4 through condor. I have setup a separate user (something like
“condor-user”) to run condor jobs on all the dedicated nodes. I
created the certificates and copied to all the nodes. So the user (“condor-user”) can ssh with out
password, within all the nodes and to its own node. But the job fails and complaining about the connection
refused to the same machine. (I.e) the job runs on Machine A, couldn’t
not connect to Machine A. Here is the error from one of the node. connect to address xxx.xx.xxx.xx: Connection refused connect to address xxx.xx.xxx.xx: Connection refused trying normal rsh (/usr/bin/rsh) MachineA: Connection refused p0_20339: p4_error: Timeout in making connection to
remote process on MachineA: 0 p0_20339: (301.989178) net_send: could not write to fd=4,
errno = 32 By default MPI jobs (MPICH1.2.4) runs on what port? so that I
can setup firewall rules. Even I tried to set the port range like this MPICH_PORT_RANGE=50001:59999 And allow the above ports in the firewall rule. But still
having the connection refused problem. Could you please let me know what might be the problem? Thanks, Senthil |