Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] NFS and the parallel universe

Date: Thu, 24 Aug 2006 12:55:54 -0600
From: "Cleary Jr, Richard N" <rnclear@xxxxxxxxxx>
Subject: [Condor-users] NFS and the parallel universe

Title: NFS and the parallel universe

Hello,

I need some help in understanding the parallel universe and a shared file system. I currently have a pool of machines that NFS mount a 5.3TByte file system for users to run their jobs out of. I am now able to run MPI/Parallel jobs across the pool, but I noticed something odd relating to the file system behavior. I previously reported a chirp error in my parallel environment and to fix it I was told to put the following entries in my submission script:

when_to_transfer_output = on_exit
should_transfer_files = yes

From the NFS mounted scratch directory I issued the condor_submit command in the parallel universe. The job failed and reported an error to the effect that the executable was no where to be found, even though it existed in the directory where I had submitted from. I read the docs and added the following line to the submit script and the job began working:

transfer_input_files = xhpl,HPL.dat

I also had to had the full path to the executable on the arguments line because the mp1script I am using reported that it also couldn't find the xhpl binary.

arguments = /condor_scratch/rnclear/hpl/bin/Linux_P4_goto/xhpl
initialdir = /condor_scratch/rnclear/hpl/bin/Linux_P4_goto

This is the contents of my mp1script:

#!/bin/sh -x

_CONDOR_PROCNO=$_CONDOR_PROCNO
_CONDOR_NPROCS=$_CONDOR_NPROCS

CONDOR_SSH=`condor_config_val libexec`
CONDOR_SSH=$CONDOR_SSH/condor_ssh

SSHD_SH=`condor_config_val libexec`
SSHD_SH=$SSHD_SH/sshd.sh

. $SSHD_SH $_CONDOR_PROCNO $_CONDOR_NPROCS

# If not the head node, just sleep forever, to let the
# sshds run
if [ $_CONDOR_PROCNO -ne 0 ]
then
                wait
                sshd_cleanup
                exit 0
fi

EXECUTABLE=$1
shift

# the binary is copied but the executable flag is cleared.
# so the script have to take care of this
chmod +x $EXECUTABLE

# Set this to the bin directory of MPICH installation
MPDIR=/usr/local/mpi/mpich/32Bit/1.2.4/gcc-3.4.3/bin/
PATH=$MPDIR:.:$PATH
export PATH

export P4_RSHCOMMAND=$CONDOR_SSH

CONDOR_CONTACT_FILE=$_CONDOR_SCRATCH_DIR/contact
export CONDOR_CONTACT_FILE

# The second field in the contact file is the machine name
# that condor_ssh knows how to use
sort -n +0 < $CONDOR_CONTACT_FILE | awk '{print $2}' > machines

## run the actual mpijob
##sleep 120
mpirun -v -np $_CONDOR_NPROCS -machinefile machines $EXECUTABLE $@

sshd_cleanup
#rm -f machines

exit $?

I poked around a little more and I noticed on the nodes where the job was running the following data directories and my binary and the input file. Also located there were the error/output files that were copied back when the job finished.

sahp4335[rnclear]: cd /opt/condor/local.sahp4335/execute/
sahp4335[rnclear]: ls
dir_19973 dir_19974
sahp4335[rnclear]: ls -lR
.:
total 8
drwxr-xr-x 3 rnclear rnclear 4096 Aug 24 07:58 dir_19973
drwxr-xr-x 3 rnclear rnclear 4096 Aug 24 07:58 dir_19974

./dir_19973:
total 516
-rwx------ 1 rnclear rnclear     51 Aug 24 07:58 chirp.config
-rwxr-xr-x 1 rnclear rnclear   1093 Aug 24 07:55 condor_exec.exe
-rw-r--r-- 1 rnclear rnclear   1159 Aug 24 07:55 HPL.dat
-rw-r--r-- 1 rnclear rnclear   2214 Aug 24 07:58 ross_error_2.out
-rw-r--r-- 1 rnclear rnclear      0 Aug 24 07:58 ross_output_2.out
drwxr-xr-x 2 rnclear rnclear   4096 Aug 24 07:58 tmp
-rwxr-xr-x 1 rnclear rnclear 500069 Aug 24 07:55 xhpl

./dir_19973/tmp:
total 16
-rw------- 1 rnclear rnclear 887 Aug 24 07:58 2.key
-rw-r--r-- 1 rnclear rnclear 226 Aug 24 07:58 2.key.pub
-rw------- 1 rnclear rnclear 883 Aug 24 07:58 hostkey
-rw-r--r-- 1 rnclear rnclear 226 Aug 24 07:58 hostkey.pub

./dir_19974:
total 516
-rwx------ 1 rnclear rnclear     51 Aug 24 07:58 chirp.config
-rwxr-xr-x 1 rnclear rnclear   1093 Aug 24 07:55 condor_exec.exe
-rw-r--r-- 1 rnclear rnclear   1159 Aug 24 07:55 HPL.dat
-rw-r--r-- 1 rnclear rnclear   2561 Aug 24 07:58 ross_error_3.out
-rw-r--r-- 1 rnclear rnclear      0 Aug 24 07:58 ross_output_3.out
drwxr-xr-x 2 rnclear rnclear   4096 Aug 24 07:58 tmp
-rwxr-xr-x 1 rnclear rnclear 500069 Aug 24 07:55 xhpl

./dir_19974/tmp:
total 16
-rw------- 1 rnclear rnclear 887 Aug 24 07:58 3.key
-rw-r--r-- 1 rnclear rnclear 226 Aug 24 07:58 3.key.pub
-rw------- 1 rnclear rnclear 883 Aug 24 07:58 hostkey
-rw-r--r-- 1 rnclear rnclear 226 Aug 24 07:58 hostkey.pub

What settings might I be missing to allow NFS nodes to function in my parallel universe? Am I misunderstanding the way NFS should behave? My experience with clusters and NFS is from the PBS environment where I submit from the directory where all of my input and output are read and written to (cd $PBS_O_WORKDIR). The MPI universe and VANILLA universe appear to work as expected, but not so the parallel. Any thoughts or ideas?

Richard

--
Richard N. Cleary
Sandia National Laboratories
Dept. 4324 Infrastructure Computing Systems
Email: rnclear@xxxxxxxxxx
Phone: 505.845.7836

Follow-Ups:
- Re: [Condor-users] NFS and the parallel universe
  - From: Greg Thain

Prev by Date: Re: [Condor-users] R script with condor
Next by Date: Re: [Condor-users] NFS and the parallel universe
Previous by thread: Re: [Condor-users] condor_history: completedsince not working and command
Next by thread: Re: [Condor-users] NFS and the parallel universe
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[Condor-users] NFS and the parallel universe