[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] MPI Jobs on Condor
- Date: Sat, 16 Apr 2011 13:14:54 +1200
- From: Asad Ali <asad06@xxxxxxxxx>
- Subject: [Condor-users] MPI Jobs on Condor
Hi all,
I am using Condor to run my MPI jobs on a large cluster of nodes. The
jobs run fine but after sometimes they automatically get restarted.
What can be the reason?
My mpi-wrapper is scripted as follows.
___________________________________________________________________________________
#!/bin/sh
EXECUTABLE=$1
CONDOR_CHIRP=`condor_config_val libexec`/condor_chirp
contact_dir=/atlas/user/atlas1/`whoami`/Condor/MPI/contact
mkdir -p $contact_dir
thisrun=`echo $_CONDOR_REMOTE_SPOOL_DIR | sed 's!^.*/cluster\([0-9]*\).*!\1!'`
contact=$contact_dir/$thisrun
hostname | $CONDOR_CHIRP put -mode cwa - $contact
if [ $_CONDOR_PROCNO -eq 0 ]; then
while [ "`awk 'END { print NR }' $contact`" -lt $_CONDOR_NPROCS ]; do
echo WAITING
sleep 1
done
/usr/bin/mpirun.openmpi -v -np $_CONDOR_NPROCS -machinefile $contact $EXECUTABLE $@
sleep 300
rm -f $contact
else
wait
exit $?
fi
exit $?
_________________________________________________________________________________________
My condor_submit file is
_________________________________________________________________________________________
######################
# Condor submit file #
######################
universe = parallel
executable = /usr/local/bin/atlas_openmpi_wrapper
arguments = /home/asad/MLDC4/lfakw4b1
machine_count = 10
should_transfer_files = yes
when_to_transfer_output = on_exit
transfer_input_files = /home/asad/MLDC4/lfakw4b1
+ParallelShutdownPolicy = "WAIT_FOR_ALL"
log = /home/asad/MLDC4/logfiles/lfakw4b1.log
output = /home/asad/MLDC4/logfiles/lfakw4b1.log.$(NODE).out
error = /home/asad/MLDC4/logfiles/lfakw4b1.log.$(NODE).error
environment = "MPI_NRPROCS=10 JOB=1"
queue
_________________________________________________________________________________________
The mpi version is (Open MPI) 1.2.7rc2. The problem is that the jobs start and run for a while and then suddenly restarts by themselves.
Cheers,
Asad
--
"A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes he has seen a mule."