Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] How to get the machine file in parallel jobs
- Date: Mon, 27 Aug 2012 13:36:23 +0200
- From: Imre Szeberenyi <szebi@xxxxxxxxxx>
- Subject: Re: [Condor-users] How to get the machine file in parallel jobs
Hi Chunbao,
I had the same problem before.
I have not found proper scripts for submitting in openmpi environment.
I have found one in the share/doc/condor-7.8.1/etc/examples/ directory
which
installs ssh daemons on the remote machines, but I cannot use it in SMP
environment.
Finally I found a solution, may be it helps for you.
- I created a shell script which collects the host info from job
status and
creates a host file containing job IDs and slot numbers for starting
mpirun.
- I force the mpirun to use condor_ssh_to_job. The only problem is the
mpirun checks the format of the host file and if it starts with
numbers it
assumes these are IP addresses. So I added a constant string to the job
IDs and a wrapper starts the condor_ssh_to_job, which removes the
constant string.
I enclosed my scripts, I hope you can find it useful as well.
If your are using openmpi-1.4 change the last command of
condor_openmpi.sh script to
exec $MPIRUN --prefix $MPI_HOME --mca plm_rsh_agent
$_CONDOR_SSH_TO_JOB_WRAPPER \
--hostfile $_CONDOR_PARALLEL_HOSTS_FILE $@
Best,
Imre
2012.08.26. 15:00 keltezéssel, miaocb@xxxxxxx írta:
Hi All,
I successfully configured condor to run parallel jobs, but I can't figure out how to get a machine file that can be used by mpiexec or mpirun to start MPI jobs. Is there an environment variable that refers to the machine file?
thanks
Chunbao Miao
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
#**************************************************************
# mpimimd.job:
#
# submitting MPI programs in SIMD/MIMD modell
#**************************************************************
universe = parallel
# name of the job
JOBNAME = mpimimd
# helper script for starting openmpi programs
executable = condor_openmpi.sh
# the script passes all its argument to the mpirun command
# eg. interface parameters
IF= -mca btl_openib_if_include mlx4_0:1
# and number of requested master processes
MNUM = 1
# and the name of the executable
MPRG = /bin/date
# and number of requested worker processes
WNUM = 3
# and the name of the executable
WPRG = /bin/hostname
arguments = $(IF) -np $(MNUM) $(MPRG) : -np $(WNUM) $(WPRG)
# the name of the hostfile generated for mpi run command
# (the default is 'parallel_hosts')
environment = _CONDOR_PARALLEL_HOSTS_FILE=$(JOBNAME).hosts
# standard output, error and log
output = $(JOBNAME).out
error = $(JOBNAME).err
log = $(JOBNAME).log
# requirement 1: the master should run on the HEAD_NODE
machine_count = 1
requirements = ( machine == HEAD_NODE )
queue
# worker's stdout and stderror redirected by mpirun
# (no need for additional redirect by condor)
output = /dev/null
error = /dev/null
# requirement 2: the workers should not run on the HEAD_NODE
machine_count = 3
requirements = ( machine =!= HEAD_NODE )
queue
#!/bin/bash
##**************************************************************
## condor_ssh_to_job_wrapper.sh:
## Created by I.Sz. <szebi@xxxxxxxxxx> BME-IIT 2012.07.17
## This is a ssh wrapper for mpirun command.
## It deletes the .*-CONDOR- prefix form the hostname (first)
## argment and invokes the condor_ssh_to_job command.Ă
##**************************************************************
#
arg1=$1; shift
arg1=`sed 's/^.*-CONDOR-//' <<< $arg1`
exec condor_ssh_to_job $arg1 $@
#!/bin/bash
##**************************************************************
## condor_parallel_hosts.sh
## Created by I.Sz. <szebi@xxxxxxxxxx> BME-IIT 2012.07.17
## Functions for collecting host and job information about the running parallel job.
## Function CONDOR_PARALLEL_HOSTS creates a hostfile including contact info for remote hosts
## Usage: Source the script and use the CONDOR_GET_PARALLEL_HOSTS_INFO function
##**************************************************************
# Defaults for error testing
: ${_CONDOR_PROCNO:=0}
: ${_CONDOR_NPROCS:=1}
: ${_CONDOR_MACHINE_AD:="None"}
: ${_CONDOR_JOB_AD:="None"}
##**************************************************************
## Usage: CONDOR_GET_PARALLEL_HOSTS_INFO [hostfile]
## If hostfile omitted 'parallel_hosts' is used.
## Return:
## The function returns with error status on main process (_CONDOR_PROCNO==0).
## The function never returns on on the other nodes (sleeping).
## The created file structure:
## HostName1'-CONDOR-'CLusterID.ProcId.SubProcId 'slots='Allocated_CPUs 'max_slots='Allocated_CPUs
## HostName2'-CONDOR-'CLusterID.ProcId.SubProcId 'slots='Allocated_CPUs 'max_slots='Allocated_CPUs
## HostName3'-CONDOR-'CLusterID.ProcId.SubProcId 'slots='Allocated_CPUs 'max_slots='Allocated_CPUs
## ...
##**************************************************************
function CONDOR_GET_PARALLEL_HOSTS_INFO() {
# getting parameters if _CONDOR_PARALLEL_HOSTS_FILE not set
: ${_CONDOR_PARALLEL_HOSTS_FILE:=$1}
# setting defaults
: ${_CONDOR_PARALLEL_HOSTS_FILE:=parallel_hosts}
local hostname=`hostname -f`
if [ $_CONDOR_PROCNO -eq 0 ]; then
# collecting info on the main proc
clusterid=`CONDOR_GET_JOB_ATTR ClusterId`
local ret=$?
if [ $ret -ne 0 ]; then
echo Error: get_job_attr ClusterId
return 1
fi
local line=""
condor_q -l $clusterid | \
awk '/^ProcId.=/ { ProcId=$3 } \
/^ClusterId.=/ { ClusterId=$3 } \
/^RequestCpus.=/ { RequestCpus=$3 } \
/^RemoteHosts.=/ { RemoteHosts=$3 } \
/^$/ { if (ClusterId != 0) print ClusterId" "ProcId" "RequestCpus" "RemoteHosts }' | \
while read line; do
CONDOR_PRINT_HOSTS $line
done | sort -d > ${_CONDOR_PARALLEL_HOSTS_FILE}
else
# endless loop on the workers
while true ; do
sleep 30
done
fi
return 0
}
## Helper fn for getting specific machine attributes from $_CONDOR_MACHINE_AD
function CONDOR_GET_MACHINE_ATTR() {
local attr="$1"
awk '/^'"$attr"'[[:space:]]+=[[:space:]]+/ \
{ ret=sub(/^'"$attr"'[[:space:]]+=[[:space:]]+/,""); print; } \
END { exit 1-ret; }' $_CONDOR_MACHINE_AD
return $?
}
## Helper fn for getting specific job attributes from $_CONDOR_JOB_AD
function CONDOR_GET_JOB_ATTR() {
local attr="$1"
awk '/^'"$attr"'[[:space:]]+=[[:space:]]+/ \
{ ret=sub(/^'"$attr"'[[:space:]]+=[[:space:]]+/,""); print; } \
END { exit 1-ret; }' $_CONDOR_JOB_AD
return $?
}
## Helper fn for printing the host info
function CONDOR_PRINT_HOSTS() {
local clusterid=$1
local procid=$2
local reqcpu=$3
local rhosts=$4
tr ',"' '\n' <<< $rhosts | grep -v $hostname | \
awk '{ sub(/slot.*@/,""); if ($1 != "") { slots[$1]+='$reqcpu'; subproc[$1]=id++; } } \
END { for (i in slots) print i"-CONDOR-"'$clusterid'".1."subproc[i]" slots="slots[i]" max_slots="slots[i]; }'
}
#!/bin/bash
##**************************************************************
## condor_openmpi.sh:
## Created by I.Sz. <szebi@xxxxxxxxxx> BME-IIT 2012.07.17
## This is a script to run openmpi jobs under the Condor parallel universe.
## Collects the host and job information into $_CONDOR_PARALLEL_HOSTS_FILE
## and executes
## $MPIRUN --perfix $MPIHOME --hostfile $_CONDOR_PARALLEL_HOSTS_FILE $@
## command
## The default value of _CONDOR_PARALLEL_HOSTS_FILE is 'parallel_hosts'
##
## The script assumes:
## On the head node (_CONDOR_PROCNO == 0) :
## * $MPIRUN points to the mpirun command
## * condor_ssh_to_job command is working (run as owner is true)
## * condor_parallel_hosts.sh and condor_ssh_to_job_wraper.sh scripts
## are available and installed in the condor libexec dir.
## On all nodes:
## * openmpi is installed into $MPI_HOME directoy
##**************************************************************
#----------------------------
MPIRUN=mpirun
MPI_HOME=/usr/lib64/openmpi
#----------------------------
_CONDOR_LIBEXEC=`condor_config_val libexec`
_CONDOR_PARALLEL_HOSTS=$_CONDOR_LIBEXEC/condor_parallel_hosts.sh
_CONDOR_SSH_TO_JOB_WRAPPER=$_CONDOR_LIBEXEC/condor_ssh_to_job_wrapper.sh
# Source the condor_parallel_hosts.sh script
. $_CONDOR_PARALLEL_HOSTS
# Creates parallel_hosts file containing contact info for hosts
# Returns on head node only
CONDOR_GET_PARALLEL_HOSTS_INFO
ret=$?
if [ $ret -ne 0 ]; then
echo Error: $ret creating $_CONDOR_PARALLEL_HOSTS_FILE
exit $ret
fi
# Starting mpirun cmd
exec $MPIRUN --prefix $MPI_HOME --mca orte_rsh_agent $_CONDOR_SSH_TO_JOB_WRAPPER \
--hostfile $_CONDOR_PARALLEL_HOSTS_FILE $@