Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] MPI Jobs on Condor
- Date: Sat, 16 Apr 2011 15:17:37 -0400
- From: Erik Aronesty <erik@xxxxxxx>
- Subject: Re: [Condor-users] MPI Jobs on Condor
There's a number of reasons why this can occur. The most insidious
is a DNS issue. Make sure your DNS is consistent across all your
machines. I posted about this a while ago with more details.
Check your logs on the executing hosts... they should explain....
usually the job is a "vacated vanilla" job.
(The default behavior of vacating vanilla jobs by killing them is
counter-intuitive for an intranet... but logical for idle-cycle
processing.)
.
On Fri, Apr 15, 2011 at 9:14 PM, Asad Ali <asad06@xxxxxxxxx> wrote:
> Hi all,
>
> I am using Condor to run my MPI jobs on a large cluster of nodes. The jobs
> run fine but after sometimes they automatically get restarted. What can be
> the reason?
>
> My mpi-wrapper is scripted as follows.
> ___________________________________________________________________________________
> #!/bin/sh
>
> EXECUTABLE=$1
>
> CONDOR_CHIRP=`condor_config_val libexec`/condor_chirp
>
> contact_dir=/atlas/user/atlas1/`whoami`/Condor/MPI/contact
>
> mkdir -p $contact_dir
>
> thisrun=`echo $_CONDOR_REMOTE_SPOOL_DIR | sed
> 's!^.*/cluster\([0-9]*\).*!\1!'`
>
> contact=$contact_dir/$thisrun
>
> hostname | $CONDOR_CHIRP put -mode cwa - $contact
>
>
> if [ $_CONDOR_PROCNO -eq 0 ]; then
> while [ "`awk 'END { print NR }' $contact`" -lt $_CONDOR_NPROCS ];
> do
> echo WAITING
> sleep 1
> done
> /usr/bin/mpirun.openmpi -v -np $_CONDOR_NPROCS -machinefile $contact
> $EXECUTABLE $@
> sleep 300
> rm -f $contact
> else
> wait
> exit $?
> fi
>
> exit $?
> _________________________________________________________________________________________
>
> My condor_submit file is
> _________________________________________________________________________________________
> ######################
> # Condor submit file #
> ######################
> universe = parallel
> executable = /usr/local/bin/atlas_openmpi_wrapper
> arguments = /home/asad/MLDC4/lfakw4b1
> machine_count = 10
> should_transfer_files = yes
> when_to_transfer_output = on_exit
> transfer_input_files = /home/asad/MLDC4/lfakw4b1
> +ParallelShutdownPolicy = "WAIT_FOR_ALL"
> log = /home/asad/MLDC4/logfiles/lfakw4b1.log
> output = /home/asad/MLDC4/logfiles/lfakw4b1.log.$(NODE).out
> error =
> /home/asad/MLDC4/logfiles/lfakw4b1.log.$(NODE).error
> environment = "MPI_NRPROCS=10 JOB=1"
> queue
> _________________________________________________________________________________________
>
> The mpi version is (Open MPI) 1.2.7rc2. The problem is that the jobs start
> and run for a while and then suddenly restarts by themselves.
>
> Cheers,
>
> Asad
>
>
>
>
>
>
> --
> "A Bayesian is one who, vaguely expecting a horse, and catching a glimpse
> of a donkey, strongly believes he has seen a mule."
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>
>