Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Parallel MPI Job
- Date: Thu, 13 Sep 2012 14:29:24 -0400
- From: Jon Thomas <jthomas@xxxxxxxxxx>
- Subject: Re: [Condor-users] Parallel MPI Job
On Sun, 2012-09-02 at 10:29 -0700, patrick cadelina wrote:
>
>
> Hi,
>
>
> I'm trying to run a simple parallel MPI hello world on condor but I
> keep getting errors. My code works using mpirun. Here's my submit
> file:
>
>
>
> universe = parallel
> requirements = (TARGET.OpSys=="LINUX" && TARGET.Arch=="INTEL")
> executable = mp2script
> arguments = hello
> log = hello.log
> output = hello.out
> error = hello.err
> machine_count = 2
> should_transfer_files = yes
> when_to_transfer_output = on_exit
> transfer_input_files = hello
> +ParallelShutdownPolicy = "WAIT_FOR_ALL"
> queue
>
>
>
>
> And here's the error that I get from the generated files:
> mpd.out.0:
> /var/lib/condor/execute/dir_3282/condor_exec.exe:
> 60: /var/lib/condor/execute/dir_3282/condor_exec.exe: mpd: not found
>
>
>
> mpd.out.1:
> /var/lib/condor/execute/dir_5103/condor_exec.exe:
> 101: /var/lib/condor/execute/dir_5103/condor_exec.exe: mpd: not found
>
When you say your code works outside of Condor using mpirun and succeeds
and you have no mpd installed according to mp2script that tells me
mpirun is using a different process manager than mpd (which is a good
thing IMHO).
Before pursuing installation of mpd, I would look to see if other
process managers are being used. As I recall some mpi implementations
have a mechanism to run in mpich1 mode, which doesn't use mpd. You might
want to look at your mpirun or mpiexec man page to see if you have that
option or the option to use hydra.
Here's a script (to replace mp2script) that I've used with intel mpi to
avoid using mpd. I've also replaced the mpirun line in the same script
with
mpiexec -launcher ssh -n $_CONDOR_NPROCS -f ${MACHINE_FILE} $EXECUTABLE
$@
for MPICH2 (MPDIR=/usr/lib64/mpich2/bin) where the launcher was hydra.
#!/bin/sh
MPDIR=/product/Fortran_MPI/intel64/bin
PATH=$MPDIR:.:$PATH
export PATH
_CONDOR_PROCNO=$_CONDOR_PROCNO
_CONDOR_NPROCS=$_CONDOR_NPROCS
# Remove the contact file, so if we are held and released
# it can be recreated anew
rm -f $CONDOR_CONTACT_FILE
PATH=`condor_config_val libexec`/:$PATH
if [ $_CONDOR_PROCNO -eq 0 ]
then
echo "trying"
echo "setting up "
echo $_CONDOR_NPROCS
SLOTS=$($(condor_config_val libexec)/condor_chirp get_job_attr
AllRemoteHosts)
MACHINE_FILE="${_CONDOR_SCRATCH_DIR}/hosts"
echo $SLOTS | sed -e 's/\"\(.*\)\".*/\1/' -e 's/,/\n/g' |tr "@" "\n"|
grep -v slot >> ${MACHINE_FILE}
echo "---"
cat ${MACHINE_FILE}
echo "---"
echo "running job"
## run the actual mpijob in mpich1 mode
mpirun -f ${MACHINE_FILE} -machinefile ${MACHINE_FILE} -n
$_CONDOR_NPROCS $EXECUTABLE $@
e=$?
sleep 20
echo "first node out"
echo $e
else
echo "second node out"
fi
>
>
> Any help would be appreciated. Thanks!
>
>
> Regards,
> Pat
>
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------
> The information transmitted is intended only for the person or entity
> to which it is addressed and may contain confidential and/or
> privileged material. Any review,retransmission,dissemination or other
> use of, or taking of any action in reliance upon, this information by
> persons or entities other than the intended recipient is prohibited.
> If you received this in error, please contact the sender and delete
> the material from any computer.
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/