Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Re: Condor with MPI-over-ssh
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On Wed, Feb 16, 2005, Mark Calleja wrote:
>
> Hence note we're using ssh to communicate over. Now, I've set up the
> dedicated VM1_USER on each machine to have passwordless ssh access
> between all MPI enabled nodes (by copying the ssh keys), and I can
> indeed run jobs as that user using mpirun successfully.
Sounds familliar to me. I had to learn, that condor doesn't use mpirun
at all, but has its own starting mechanism.
>The problem
> arises when I submit the same job as a Condor job. I've set up a
> dedicated submit node as mentioned in the manual, and tried an initial
> job which just two nodes. The first execute node goes into a
> Claimed/Busy state while the second goes into a Claimed/Idle state
> before finally the job fails with the following sent to the stdout of
> the first execute node:
I'll just copy-and-paste a reply i got from Erik Paulsen on this list:
- -------------
> p0_2957: p4_error: Child process exited while making connection to
> remote process on c029.cip.physik.local: 0
> p0_2957: (6.333597) net_send: could not write to fd=4, errno = 32
>
> As far as i understand, condor uses /home/condor/condor/sbin/rsh to
> start the job, right ? this doesn't work, as for security reasons, rsh
> is
> not allowed here.
You'll note that /home/condor/condor/sbin/rsh is not really rsh, it's
just named rsh. It does not have the security problems of the Berekely
rsh.
> So i set up ssh :
>
> bash-2.05b$ whoami
> condor
> bash-2.05b$ ssh c029 date
> Tue Feb 1 12:41:20 CET 2005
>
> and linked it there, but this didn't help either.
>
That was your mistake. Put the condor program named 'rsh' back.
> So
> a) how do i tell condor where to look for mpi
> b) how do i tell condor to use ssh ?
>
a) you don't need to
b) you can't
Link your job with MPICH 1.2.4 for the ch_p4 device. Condor does not
need any MPI runtime support (we don't use mpirun)
- ------------------------
Btw, when you compile mpich 1.2.4, are there any errors during 'make
testing'? If no, how did you do that, on what kind of system ?
I always get (Mandrake 10.0):
**** Testing I/O functions ****
c074 : So Feb 6 17:55:01 CET 2005
/tmp/mpi/mpich-1.2.4/bin/mpicc -c simple.c
/tmp/mpi/mpich-1.2.4/bin/mpicc -o simple simple.o
/tmp/mpi/mpich-1.2.4/lib/libpmpich.a(getpname.o)(.text+0x13): In
function `PMPI_Get_processor_name':
: undefined reference to `MPID_Node_name'
collect2: ld returned 1 exit status
make[3]: *** [simple] Error 1
Could not build executable simple; aborting tests
make[2]: [testing] Error 1 (ignored)
End of testing in directory io
Running tests in directory command
End of testing in directory command
c074:mpich-1.2.4>
- --
________ This message is made of 100 % recycled electrons
\..| PGP Key: www.stud.uni-goettingen.de/~s242275/pgpkey.pub (o_
.\.|-- Jabber: te_linuxguru at jabber.fsinf.de (o (o //\
..\|____ ICQ: 124557012 (/)_(/)_V_/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
iD8DBQFCEwq7KpMiVYRJv9YRAidnAJ0Vepb62rUMiFnM1QujBxA7u62OHgCeIXZa
Ol3Vw5JeU4tnb3xEvOSB3FM=
=gpIw
-----END PGP SIGNATURE-----