| Mailing List ArchivesAuthenticated access |  | ![[Computer Systems Lab]](http://www.cs.wisc.edu/pics/csl_logo.gif)  | 
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Adding GPUs to machine resources
- Date: Wed, 12 Mar 2014 16:06:46 +0100
- From: Steffen Grunewald <Steffen.Grunewald@xxxxxxxxxx>
- Subject: [HTCondor-users] Adding GPUs to machine resources
I've been running Condor for more than a decade now, but being 
rather new to the Condor/GPU business, I'm having a hard time now.
Following http://spinningmatt.wordpress.com/2012/11/19, I have tried
to add two GPUs to the resources available to a standalone machine
with a number of CPU cores, by defining in condor_config.d/gpu:
MACHINE_RESOURCE_NAMES    = GPUS
MACHINE_RESOURCE_GPUS     = 2
SLOT_TYPE_1               = cpus=100%,auto
SLOT_TYPE_1_PARTITIONABLE = TRUE
NUM_SLOTS_TYPE_1          = 1
I added a "request_gpus" line to my - otherwise rather simplistic -
submit file, specifying either 1 or 0.
This works - depending on the amount of free resources (obviously,
the GPUS are the least abundant one), jobs get matched and started.
Checking the output of condor_status -l for the individual dynamic
slots, the numbers look OK.
(I'm wondering whether I'd have to set request_gpus=0 somewhere.
Seems to default to 0 though.)
Now the idea is to tell the job - via arguments, environment,
or a job wrapper - which GPU to use. This is where I ran out of
ideas.
https://htcondor-wiki.cs.wiki.edu/index.cgi/wiki?p=HowToManageGpus
suggests to use 
  arguments = @...$((AssignedGPUs))
but this macro cannot be expanded on job submission...
There's no _CONDOR_AssignedGPUs in the "printenv" output.
Even
# grep -i gpu /var/lib/condor/execute/dir_*/.{machine,job}.ad
doesn't show anything that looks helpful.
Addition of a line
ENVIRONMENT_FOR_AssignedGpus = CUDA_VISIBLE_DEVICES, GPU_DEVICE_ORDINAL
as suggested in the wiki page shows no effect at all.
Also, $(LIBEXEC)/condor_gpu_discovery doesn't work as expected:
# /usr/lib/condor/libexec/condor_gpu_discovery [-properties]
modprobe: FATAL: Module nvidia-uvm not found.
2
(and -properties makes no difference)
In the end, I'd like to have up to TotalGpus slots with a (or
both) GPU/s assigned to it/them, and $CUDA_VISIBLE_DEVICES or
another environment variable telling me (and a possible wrapper
script) the device numbers. (I also suppose that a non-GPU slot
would have to set $CUDA_VISIBLE_DEVICES to the empty string or
-1?)
In an era of partitionable resources, will I still have to revert
to static assignments of the individual GPUs to static slots? I
don't hope so (as this doesn't provide an easy means to allocate
both GPUs to a single job)...
Any suggestions?
Thanks,
 S
-- 
Steffen Grunewald * Cluster Admin * steffen.grunewald(*)aei.mpg.de
MPI f. Gravitationsphysik (AEI) * Am Mühlenberg 1, D-14476 Potsdam
http://www.aei.mpg.de/ * ------- * +49-331-567-{fon:7274,fax:7298}