Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Not managing to get the parallel universe example from manual section "2.11.2 Parallel Job Submission" to run
- Date: Sat, 18 Feb 2006 17:49:48 +0000
- From: Jean-Alain Grunchec <jgrunche@xxxxxxxxxxxxxxxxxx>
- Subject: Re: [Condor-users] Not managing to get the parallel universe example from manual section "2.11.2 Parallel Job Submission" to run
Hi,
I noticed that the mpi universe works fine. Well sort of. It does work
as long as I turn off iptables. If it is on, I get error messages in
the outputfiles :
outfile.0
p0_1136: p4_error: Timeout in making connection to remote process on
pirineu.cap.ed.ac.uk: 0
p0_1136: (302.006268) net_send: could not write to fd=4, errno = 32
outfile.1
rm_4994: p4_error: rm_start: net_conn_to_listener failed: 33192
So I don't know if there is a way to restrict the range of ports that I
assume MPI uses.
I also tried to run MPI through the parallel universe, but this does
not work. I used the example mp1script, and set MPDIR to the path of
the bin directory of my MPI distribution.
I get some errors in the errfile.0
connect to address 129.215.191.107 port 544: Connection refused
connect to address 129.215.191.107 port 544: Connection refused
trying normal rsh (/usr/bin/rsh)
pirineu.cap.ed.ac.uk: Connection refused
This puzzled me for a while, since there is a CONDOR_SSH and
P4_RSHCOMMAND environment variables defined in mp1script, so I assumed
that rsh wasn't called (and instead condor_ssh). But those variables
seem to be OK.
I altered the line in mp1script :
PATH=$MPDIR:.:$PATH
to
PATH=$MPDIR:`condor_config_val libexec`:.:$PATH
so that if rsh is called, it would be found somehow in the condor folder.
After this change there wasn't any error message in errfile.0 .
But there was still some error messages displayed in outfile.0 :
/usr/local/condor/libexec/condor_ssh
running /home/condor/execute/dir_717/simplempi on 2 LINUX ch_p4 processors
Created /home/condor/execute/dir_717/PI760
Starting
p0_844: p4_error: Timeout in making connection to remote process on
pirineu.cap.ed.ac.uk: 0
p0_844: (302.474659) net_send: could not write to fd=4, errno = 32
The first line is just related to an echo $P4_RSHCOMMAND so this is OK.
However there are some errors afterwards.
I had a look to the file PI760. It looks a bit like a 'p4pg' (P4 proc
group) file, but I might be wrong. It this is supposed to work like a
p4pg file, there is something which surprised me a bit :
ys.cap.ed.ac.uk 0 /home/condor/execute/dir_717/simplempi
pirineu.cap.ed.ac.uk 1 /home/condor/execute/dir_717/simplempi
The temporary directory dir_717 is indeed the local path of simplempi
on ys.cap.ed.ac.uk, but it is NOT the temporary path on
pirineu.cap.ed.ac.uk, which has another temporary directories (dir_4975
or something). So this looks a bit strange although I assume it should
work with a shared file system.
Therefore I altered the mp1script so that it writes a P4 proc group
file (written from the 'contact' file like the 'machine' file is
written) which will be linked with the -p4pg option given to mpirun.
Unfortunately this did not give better results.
I wonder if there is something I need to investigate a bit more
carefully. If anybody manages to run MPI through the parallel universe
on a network of desktop workstations withouth a shared file system, I
would be interested by the kind of scripts they use.
Thanks,