Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Adding GPUs to machine resources
- Date: Wed, 16 Apr 2014 13:31:08 +0200
- From: Steffen Grunewald <Steffen.Grunewald@xxxxxxxxxx>
- Subject: Re: [HTCondor-users] Adding GPUs to machine resources
On Wed, Mar 12, 2014 at 04:06:46PM +0100, Steffen Grunewald wrote:
>
> Following http://spinningmatt.wordpress.com/2012/11/19, I have tried
> to add two GPUs to the resources available to a standalone machine
> with a number of CPU cores, by defining in condor_config.d/gpu:
>
> MACHINE_RESOURCE_NAMES = GPUS
> MACHINE_RESOURCE_GPUS = 2
>
> SLOT_TYPE_1 = cpus=100%,auto
> SLOT_TYPE_1_PARTITIONABLE = TRUE
> NUM_SLOTS_TYPE_1 = 1
>
> I added a "request_gpus" line to my - otherwise rather simplistic -
> submit file, specifying either 1 or 0.
> This works - depending on the amount of free resources (obviously,
> the GPUS are the least abundant one), jobs get matched and started.
> Checking the output of condor_status -l for the individual dynamic
> slots, the numbers look OK.
> (I'm wondering whether I'd have to set request_gpus=0 somewhere.
> Seems to default to 0 though.)
>
> Now the idea is to tell the job - via arguments, environment,
> or a job wrapper - which GPU to use. This is where I ran out of
> ideas.
>
> https://htcondor-wiki.cs.wiki.edu/index.cgi/wiki?p=HowToManageGpus
>
> Addition of a line
> ENVIRONMENT_FOR_AssignedGpus = CUDA_VISIBLE_DEVICES, GPU_DEVICE_ORDINAL
> as suggested in the wiki page shows no effect at all.
I evetually found that upgrading from 8.0.5 to 8.1.4 would add
the functionality I was looking for, and even the condor_gpu_discovery
command would yield better results:
root@krakatoa# /usr/lib/condor/libexec/condor_gpu_discovery -properties
modprobe: FATAL: Module nvidia-uvm not found.
DetectedGPUs="CUDA0, CUDA1"
CUDACapability=3.5
CUDADeviceName="Tesla K20c"
CUDADriverVersion=6.0
CUDAECCEnabled=false
CUDAGlobalMemoryMb=4800
CUDARuntimeVersion=5.50
root@krakatoa# /usr/lib/condor/libexec/condor_gpu_discovery -properties -dynamic
modprobe: FATAL: Module nvidia-uvm not found.
DetectedGPUs="CUDA0, CUDA1"
CUDACapability=3.5
CUDADeviceName="Tesla K20c"
CUDADriverVersion=6.0
CUDAECCEnabled=false
CUDAGlobalMemoryMb=4800
CUDARuntimeVersion=5.50
CUDA0FanSpeedPct=36
CUDA0PowerUsage_mw=49804
CUDA0DieTempF=45
CUDA0EccErrorsSingleBit=0
CUDA0EccErrorsDoubleBit=0
CUDA1FanSpeedPct=33
CUDA1PowerUsage_mw=43265
CUDA1DieTempF=44
CUDA1EccErrorsSingleBit=0
CUDA1EccErrorsDoubleBit=0
As the "nvidia" module has been already loaded, the "FATAL" error
seems to have no ill side-effects (and I suppose the stderr output
would be dropped)
I'll procees with MACHINE_RESOURCE_INVENTORY_GPUS, and work my
way through the rest of the configuration...
Thanks to all who responded.
- S