[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] openmpi jobs with condor [on debian] (general part)



Hello all,

here how I managed to get openmpi running in parallel universe.
The infinband part will follow.

Howto openmpi with htcondor (general part)

We use htcondor 8.4.x with debian 7 and debian 8 and use openmpi 1.6.5 mostly 
with debian 7,
with debian 8 we just tested a small openmpi example.
We use a common file system for all nodes, htcondor claims it does work also 
without
(but is this realy useful?).
Requirements:
- Setup your htcondor environment for parallel jobs (see manual section 2.9)
- Running openmpi (test it on a single node or in the vanilla universe 
[section 2.9.4])
- ssh client and server on each node.

In my understanding, htcondor just claims the needed slots, prepares and start 
the sshd on the running
nodes and than just start mpirun. This is done by the openmpiscript (see 
section 2.9.3) and other scripts.
>From htcondor 8.6.1 on, these scripts are improved and condor variables can be 
set which are used by
openmpiscript. In earlier versions one have to change the openmpiscript 
directly.

What I have to do to get openmpi running?
Change the openmpiscript:
1. the openmpiscript is a bash script, therefore make sure that bash not sh 
was used
for example use
#!/bin/bash
not
#!/bin/sh
debian often use dash as system shell which is not fully bash compatible.
My suggestion is that condor use for all scripts bash explicitly, at least if 
they are
not fully bourne shell compatible and therfore need bash not sh.
Is there any system where bash could not be installed under /bin/bash?

2. change MPDIR to the prefix dir of your openmpi

Take into account:
The scripts will run into problems if you add a path for your program (in the 
argument for openmpiscript).
Therefore put the program into your working directory and submit from there.


Improvemts:
We often have seen that after condor_rm the mpi processes are still running 
but the parallel job was
removed from condor.
Following the philosophy of condor, that mpirun have to do the job, we start 
mpirun in background
and wait for this process. This allows us to install a signal handler with 
trap, which send
a TERM signal to mpirun after the openmpiscript gets the TERM signal.
With this signal handler we never have seen the problem above. I do not know 
if and why the condor team
does think this was not necessary, but at least it works for us.

Best regards
Harald