Hello,
I need some help in understanding the parallel universe and a shared file system. I currently have a pool of machines that NFS mount a 5.3TByte file system for users to run their jobs out of. I am now able to run MPI/Parallel jobs across the pool, but I noticed something odd relating to the file system behavior. I previously reported a chirp error in my parallel environment and to fix it I was told to put the following entries in my submission script:
when_to_transfer_output = on_exit
should_transfer_files = yes
From the NFS mounted scratch directory I issued the condor_submit command in the parallel universe. The job failed and reported an error to the effect that the executable was no where to be found, even though it existed in the directory where I had submitted from. I read the docs and added the following line to the submit script and the job began working:
transfer_input_files = xhpl,HPL.dat
I also had to had the full path to the executable on the arguments line because the mp1script I am using reported that it also couldn't find the xhpl binary.
arguments = /condor_scratch/rnclear/hpl/bin/Linux_P4_goto/xhpl
initialdir = /condor_scratch/rnclear/hpl/bin/Linux_P4_goto
This is the contents of my mp1script:
#!/bin/sh -x
_CONDOR_PROCNO=$_CONDOR_PROCNO
_CONDOR_NPROCS=$_CONDOR_NPROCS
CONDOR_SSH=`condor_config_val libexec`
CONDOR_SSH=$CONDOR_SSH/condor_ssh
SSHD_SH=`condor_config_val libexec`
SSHD_SH=$SSHD_SH/sshd.sh
. $SSHD_SH $_CONDOR_PROCNO $_CONDOR_NPROCS
# If not the head node, just sleep forever, to let the
# sshds run
if [ $_CONDOR_PROCNO -ne 0 ]
then
wait
sshd_cleanup
exit 0
fi
EXECUTABLE=$1
shift
# the binary is copied but the executable flag is cleared.
# so the script have to take care of this
chmod +x $EXECUTABLE
# Set this to the bin directory of MPICH installation
MPDIR=/usr/local/mpi/mpich/32Bit/1.2.4/gcc-3.4.3/bin/
PATH=$MPDIR:.:$PATH
export PATH
export P4_RSHCOMMAND=$CONDOR_SSH
CONDOR_CONTACT_FILE=$_CONDOR_SCRATCH_DIR/contact
export CONDOR_CONTACT_FILE
# The second field in the contact file is the machine name
# that condor_ssh knows how to use
sort -n +0 < $CONDOR_CONTACT_FILE | awk '{print $2}' > machines
## run the actual mpijob
##sleep 120
mpirun -v -np $_CONDOR_NPROCS -machinefile machines $EXECUTABLE $@
sshd_cleanup
#rm -f machines
exit $?
I poked around a little more and I noticed on the nodes where the job was running the following data directories and my binary and the input file. Also located there were the error/output files that were copied back when the job finished.
sahp4335[rnclear]: cd /opt/condor/local.sahp4335/execute/
sahp4335[rnclear]: ls
dir_19973 dir_19974
sahp4335[rnclear]: ls -lR
.:
total 8
drwxr-xr-x 3 rnclear rnclear 4096 Aug 24 07:58 dir_19973
drwxr-xr-x 3 rnclear rnclear 4096 Aug 24 07:58 dir_19974
./dir_19973:
total 516
-rwx------ 1 rnclear rnclear 51 Aug 24 07:58 chirp.config
-rwxr-xr-x 1 rnclear rnclear 1093 Aug 24 07:55 condor_exec.exe
-rw-r--r-- 1 rnclear rnclear 1159 Aug 24 07:55 HPL.dat
-rw-r--r-- 1 rnclear rnclear 2214 Aug 24 07:58 ross_error_2.out
-rw-r--r-- 1 rnclear rnclear 0 Aug 24 07:58 ross_output_2.out
drwxr-xr-x 2 rnclear rnclear 4096 Aug 24 07:58 tmp
-rwxr-xr-x 1 rnclear rnclear 500069 Aug 24 07:55 xhpl
./dir_19973/tmp:
total 16
-rw------- 1 rnclear rnclear 887 Aug 24 07:58 2.key
-rw-r--r-- 1 rnclear rnclear 226 Aug 24 07:58 2.key.pub
-rw------- 1 rnclear rnclear 883 Aug 24 07:58 hostkey
-rw-r--r-- 1 rnclear rnclear 226 Aug 24 07:58 hostkey.pub
./dir_19974:
total 516
-rwx------ 1 rnclear rnclear 51 Aug 24 07:58 chirp.config
-rwxr-xr-x 1 rnclear rnclear 1093 Aug 24 07:55 condor_exec.exe
-rw-r--r-- 1 rnclear rnclear 1159 Aug 24 07:55 HPL.dat
-rw-r--r-- 1 rnclear rnclear 2561 Aug 24 07:58 ross_error_3.out
-rw-r--r-- 1 rnclear rnclear 0 Aug 24 07:58 ross_output_3.out
drwxr-xr-x 2 rnclear rnclear 4096 Aug 24 07:58 tmp
-rwxr-xr-x 1 rnclear rnclear 500069 Aug 24 07:55 xhpl
./dir_19974/tmp:
total 16
-rw------- 1 rnclear rnclear 887 Aug 24 07:58 3.key
-rw-r--r-- 1 rnclear rnclear 226 Aug 24 07:58 3.key.pub
-rw------- 1 rnclear rnclear 883 Aug 24 07:58 hostkey
-rw-r--r-- 1 rnclear rnclear 226 Aug 24 07:58 hostkey.pub
What settings might I be missing to allow NFS nodes to function in my parallel universe? Am I misunderstanding the way NFS should behave? My experience with clusters and NFS is from the PBS environment where I submit from the directory where all of my input and output are read and written to (cd $PBS_O_WORKDIR). The MPI universe and VANILLA universe appear to work as expected, but not so the parallel. Any thoughts or ideas?
Richard
--
Richard N. Cleary
Sandia National Laboratories
Dept. 4324 Infrastructure Computing Systems
Email: rnclear@xxxxxxxxxx
Phone: 505.845.7836