Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] htcondor + gpudirect + openmpi
- Date: Fri, 15 Sep 2017 10:01:54 +0530 (IST)
- From: Malathi Deenadayalan <malathi@xxxxxxxx>
- Subject: Re: [HTCondor-users] htcondor + gpudirect + openmpi
Hello all,
Can you tell me how this works and in which file we have to edit this.
Regards,
Malathi
----- Original Message -----
From: "Harald van Pee" <pee@xxxxxxxxxxxxxxxxx>
To: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Sent: Tuesday, September 12, 2017 8:31:03 PM
Subject: Re: [HTCondor-users] htcondor + gpudirect + openmpi
Hello all,
I think I have now a working version for all cases,
CONDOR_CHIRP=`condor_config_val libexec`
CONDOR_CHIRP=$CONDOR_CHIRP/condor_chirp
ncpus=`$CONDOR_CHIRP get_job_attr RequestCpus`
ngpus=`$CONDOR_CHIRP get_job_attr RequestGpus`
...
sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $2}' > machines
#sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $1}' > machines
for(( i=1 ; i <$ngpus ; i++)) ; do
echo i= $i
sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $2}' >> machines
# sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $1}' >> machines
done;
...
nmpinodes=$(( $ngpus * $_CONDOR_NPROCS))
...
mpirun -v --prefix $MPDIR --mca $mca_ssh_agent $CONDOR_SSH -n $nmpinodes -
hostfile machines $EXECUTABLE $@ &
but
I have to use the old condor_ssh version from htcondor 8.4 which uses
hostnames not proc numbers (indeed I just changed back these parts).
If I do not use hostnames, it could hapen, that if a
request_cpus=1/request_gpus=1 job lands several times on one machine, there is
an sshd running and mpirun starts all jobs on that machine and ignores
completly all others.
Therfore I think we need hosts in the machine file, because mpirun can not
handle procnumbers.
Why was it changed? Any other pitfalls?
Best
Harald
On Monday 11 September 2017 23:10:09 Harald van Pee wrote:
> On Monday 11 September 2017 22:10:18 Michael Pelletier wrote:
> > I've been using the job ad file for non-dynamic values. For dynamic stuff
> > you could use condor_chirp get_job_attr.
> >
> > condor_q -jobads $_CONDOR_JOB_AD -autoformat RequestCpus
>
> Thanks!
>
> > -Michael Pelletier.
> >
> > > -----Original Message-----
> > > From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On
> > > Behalf Of Harald van Pee
> > > Sent: Monday, September 11, 2017 3:45 PM
> > > To: htcondor-users@xxxxxxxxxxx
> > > Subject: Re: [HTCondor-users] htcondor + gpudirect + openmpi
> > >
> > > Hi Jason,
> > >
> > > I think I have done something wrong, or its just working since I have
> > > installed Mellanox OFED 4.1.
> > >
> > > Up to now I have tested it only with openmpi-2.0.2a1 but at least for
> > > this version its working if I make a loop over the requested gpus to
> > > just get more lines in the machine file and for the -n argument I have
> > > to multiply $_CONDOR_NPROCS with the requested gpus.
> > >
> > > How I get the number of requested gpus in the script?
> > > At the moment I would parse
> > > _CONDOR_AssignedGPUs and count them.
> > >
> > > OMP_NUM_THREADS can be used to get the request_cpus value but in
> > > general this is not the number of mpinodes per node.
> > >
> > > Up to now I just test the cpus but I hope I can start with real gpu
> > > jobs soon.
> >
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
> > a subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/