Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Running distributed MPI jobs and ssh settings
- Date: Wed, 06 Mar 2013 09:12:33 -0500
- From: Jon Thomas <jthomas@xxxxxxxxxx>
- Subject: Re: [HTCondor-users] Running distributed MPI jobs and ssh settings
On Tue, 2013-03-05 at 12:43 -0300, Manuel Ferreria wrote:
> Hello,
>
>
> I've been trying to set up a virtual condor cluster, to better
> understand how to administer it. I've been running into this problem
> and can't seem to wrap my head around it.
>
>
> I have 2 nodes setup, "head.cluster" and "c1.cluster" (both are mapped
> using hosts file to the corresponding ips). head.cluster is both the
> submitter, the schedd and a compute node. c1.cluster is just a compute
> node. I've setup passwordless ssh login between both machines (for my
> local user at least). Both VM's have 4 cores each, so I have in total
> 8 compute cpus.
>
>
> I made a distributed hello world application (just printing the rank).
> I compile it with mpicc (which uses a user built openMPI) and then run
> condor_submit.
>
>
> Now, if the job is schedule to run on a single machine (head.cluster
> OR c1.cluster) the job runs ok, and prints expected output. BUT if the
> job runs distributed, as in 1 process con head.cluster and 1 on
> c1.cluster, the job completes but mpirun crashes with:
>
>
> --------------------------------------------------------------------------
> A daemon (pid 2371) died unexpectedly with status 255 while attempting
> to launch so we are aborting.
Any idea what pid 2371 was? sshd?
>
>
> There may be more information reported by the environment (see above).
>
>
> This may be because the daemon was unable to find all the needed
> shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have
> the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
>
>
>
>
> I've checked LD_LIBRARY_PATH and it's exactly the same on both
> machines. The curious thing is if I manually run mpirun with the same
> hostfile generated by the openMPI script (but without the orte and
> condor_sshd parameter) it works fine.
>
>
> If I remove those parameters in the openmpiscript called by the submit
> file (to make it exactly the same as i am running outside of condor),
> I get the same error preceded by
> ----
> Could not create directory '/.ssh'.
> No protocol specified
>
Sounds like you are trying to set up ssh keys despite already having
passwordless ssh configured. I'd have to see the openmpiscript and sdf
to say much more.
>
> (gnome-ssh-askpass:2825): Gtk-WARNING **: cannot open display: :0.0
> Host key verification failed.
>
>
> ----------------
>
>
>
>
> i am guessing it's some setting of ssh I haven't configured properly,
> but I can't put my finger on it. Anyone has an idea?
>
>
> Regards,
>
>
> Manuel Ferreria
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/