I've installed the Condor development series (8.1.4) on execute
nodes that have GPUs installed. The rest of the Condor cluster
is all on 8.0.5. I am following the instructions at
to advertise the GPUs as part of the Machine ClassAd. The
machine is configured as a single partitionable slot with all
CPUs/RAM/GPUs):
Note, in particular, the value of AssignedGPUs. Also note this:
Following a hunch from ticket #3386, I added the -dynamic
argument:
So one issue is that I'm not sure if AssignedGPUs is
correct. No matter what I do, the following command returns
empty:
--
Tom Downes
Associate Scientist and Data Center Manager
Center for Gravitation, Cosmology and Astrophysics
University of Wisconsin-Milwaukee
414.229.2678
On Wed, Mar 12, 2014 at 4:06 PM, Steffen Grunewald <
Steffen.Grunewald@xxxxxxxxxx>
wrote:
>
> I've been running Condor for more than a decade now, but
being
> rather new to the Condor/GPU business, I'm having a hard
time now.
>
> Following
http://spinningmatt.wordpress.com/2012/11/19,
I have tried
> to add two GPUs to the resources available to a
standalone machine
> with a number of CPU cores, by defining in
condor_config.d/gpu:
>
> MACHINE_RESOURCE_NAMES = GPUS
> MACHINE_RESOURCE_GPUS = 2
>
> SLOT_TYPE_1 = cpus=100%,auto
> SLOT_TYPE_1_PARTITIONABLE = TRUE
> NUM_SLOTS_TYPE_1 = 1
>
> I added a "request_gpus" line to my - otherwise rather
simplistic -
> submit file, specifying either 1 or 0.
> This works - depending on the amount of free resources
(obviously,
> the GPUS are the least abundant one), jobs get matched
and started.
> Checking the output of condor_status -l for the
individual dynamic
> slots, the numbers look OK.
> (I'm wondering whether I'd have to set request_gpus=0
somewhere.
> Seems to default to 0 though.)
>
> Now the idea is to tell the job - via arguments,
environment,
> or a job wrapper - which GPU to use. This is where I ran
out of
> ideas.
>
>
https://htcondor-wiki.cs.wiki.edu/index.cgi/wiki?p=HowToManageGpus
> suggests to use
> arguments = @...$((AssignedGPUs))
> but this macro cannot be expanded on job submission...
>
> There's no _CONDOR_AssignedGPUs in the "printenv" output.
>
> Even
> # grep -i gpu
/var/lib/condor/execute/dir_*/.{machine,job}.ad
> doesn't show anything that looks helpful.
>
> Addition of a line
> ENVIRONMENT_FOR_AssignedGpus = CUDA_VISIBLE_DEVICES,
GPU_DEVICE_ORDINAL
> as suggested in the wiki page shows no effect at all.
>
> Also, $(LIBEXEC)/condor_gpu_discovery doesn't work as
expected:
> # /usr/lib/condor/libexec/condor_gpu_discovery
[-properties]
> modprobe: FATAL: Module nvidia-uvm not found.
> 2
> (and -properties makes no difference)
>
> In the end, I'd like to have up to TotalGpus slots with a
(or
> both) GPU/s assigned to it/them, and
$CUDA_VISIBLE_DEVICES or
> another environment variable telling me (and a possible
wrapper
> script) the device numbers. (I also suppose that a
non-GPU slot
> would have to set $CUDA_VISIBLE_DEVICES to the empty
string or
> -1?)
>
> In an era of partitionable resources, will I still have
to revert
> to static assignments of the individual GPUs to static
slots? I
> don't hope so (as this doesn't provide an easy means to
allocate
> both GPUs to a single job)...
>
> Any suggestions?
>
> Thanks,
> S
>
> --
> Steffen Grunewald * Cluster Admin * steffen.grunewald(*)
aei.mpg.de
> MPI f. Gravitationsphysik (AEI) * Am Mühlenberg 1,
D-14476 Potsdam
>
http://www.aei.mpg.de/
* ------- * +49-331-567-{fon:7274,fax:7298}
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx
with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
>
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
>
https://lists.cs.wisc.edu/archive/htcondor-users/