Openmpiscript(from the example):
#!/bin/sh
MPDIR=/usr/lib/openmpi
if `uname -m | grep "64" 1>/dev/null 2>&1`
then
ÂÂÂ MPDIR=/usr/lib64/openmpi
fi
PATH=$MPDIR/lib:$MPDIR/1.4-gcc/bin:.:$PATH
export PATH
_CONDOR_PROCNO=$_CONDOR_PROCNO
_CONDOR_NPROCS=$_CONDOR_NPROCS
CONDOR_SSH=`condor_config_val libexec`
CONDOR_SSH=$CONDOR_SSH/condor_ssh
SSHD_SH=`condor_config_val libexec`
SSHD_SH=$SSHD_SH/sshd.sh
. $SSHD_SH $_CONDOR_PROCNO $_CONDOR_NPROCS
# If not the head node, just sleep forever, to let the sshds run
if [ $_CONDOR_PROCNO -ne 0 ]
then
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ wait
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ sshd_cleanup
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ exit 0
fi
EXECUTABLE=$1
shift
chmod +x $EXECUTABLE
CONDOR_CONTACT_FILE=$_CONDOR_SCRATCH_DIR/contact
export CONDOR_CONTACT_FILE
# Added for Debug
echo "Contact File: ${CONDOR_CONTACT_FILE}"
cat ${CONDOR_CONTACT_FILE}
# The second field in the contact file is the machine name
# that condor_ssh knows how to use
sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $2}' > machines
# Added for Debug
echo "Machines"
cat machines
## run the actual mpijob
if `ompi_info --param all all | grep orte_rsh_agent 1>/dev/null 2>&1`
then
ÂÂÂ echo "IF" # Added for Debug
ÂÂÂ mpirun -v --prefix $MPDIR --mca orte_rsh_agent $CONDOR_SSH -n $_CONDOR_NPROCS -hostfile machines $EXECUTABLE $@
else
ÂÂÂ ########## For mpi versions 1.1 & 1.2 use the line below
ÂÂÂ echo "ELSE" # Added for Debug
ÂÂÂ mpirun -v --mca plm_rsh_agent $CONDOR_SSH -n $_CONDOR_NPROCS -hostfile machines $EXECUTABLE $@
fi
sshd_cleanup
rm -f machines
exit $?
******************************************************
After read the docs and de error, we check the sshd.sh file in the worker node, and found this (Line 125):
if [ $_CONDOR_PROCNO -eq 0 ]
Line 113 has this:
echo "$_CONDOR_PROCNO $hostname $PORT $user $currentDir $thisrun"Â |
ÂÂÂÂÂÂÂ $CONDOR_CHIRP put -mode cwa - $_CONDOR_REMOTE_SPOOL_DIR/contact
To check the output we change it to this:
echo "$_CONDOR_PROCNO N $_CONDOR_NPROCS $hostname $PORT $user $currentDir $thisrun"Â |
ÂÂÂÂÂÂÂ $CONDOR_CHIRP put -mode cwa - $_CONDOR_REMOTE_SPOOL_DIR/contact
So the $_CONDOR_PROCNO it's not a number but the executable's name and $_CONDOR_NPROCS it's empty.
Anyone can help us to solve this issue? Any ideas?
Thank you very much.
--
Edier Alberto Zapata HernÃndez
Ingeniero de Soporte en Infraestructura
CIER - Sur