Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] htcondor + gpudirect + openmpi
- Date: Tue, 05 Sep 2017 21:04:20 +0200
- From: Harald van Pee <pee@xxxxxxxxxxxxxxxxx>
- Subject: [HTCondor-users] htcondor + gpudirect + openmpi
Dear all,
we want to use htcondor 8.6.5 in a gpu cluster with openmpi in the parallel
universe.
Our main task will be to run openmpi with up to 16 gpus on nodes with 4 or 8
gpus installed.
To profit from the p2p connection on the board we want to have 4 or 8 mpi
processes running on one machine and not distributed over the whole cluster.
If we use for example
universe = parallel
executable = /mpi/openmpiscript
arguments = a.out
machine_count = 2
request_cpus = 4
request_gpus = 4
the slots are reserved correct, but openmpiscript ignores the cpu request and
starts 2 mpi processes in total and not 4 on each node used.
if I just copy the hosts 4 times
sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $1}' > machines
sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $1}' >> machines
sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $1}' >> machines
sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $1}' >> machines
and use
mpirun ... -n 8 -hostfile machines ...
the a.out processes are start, 4 on each machine, but all 4 processes a bound
to the same core.
How can I manage that 4 a.out processes run on each machine and use 4 cores in
total or even more if each of them uses threads.
Best
Harald