Dear all,
we want to use htcondor 8.6.5 in a gpu cluster with openmpi in the parallel
universe.
Our main task will be to run openmpi with up to 16 gpus on nodes with 4 or 8
gpus installed.
To profit from the p2p connection on the board we want to have 4 or 8 mpi
processes running on one machine and not distributed over the whole cluster.
If we use for example
universe = parallel
executable = /mpi/openmpiscript
arguments = a.out
machine_count = 2
request_cpus = 4
request_gpus = 4
the slots are reserved correct, but openmpiscript ignores the cpu request and
starts 2 mpi processes in total and not 4 on each node used.
if I just copy the hosts 4 times
sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $1}' > machines
sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $1}' >> machines
sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $1}' >> machines
sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $1}' >> machines
and use
mpirun ... -n 8Â -hostfile machines ...
the a.out processes are start, 4 on each machine, but all 4 processes a bound
to the same core.
How can I manage that 4 a.out processes run on each machine and use 4 cores in
total or even more if each of them uses threads.
Best
Harald
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor- users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/