[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Adding GPUs to machine resources



On Wed, Apr 16, 2014 at 02:41:59PM +0200, Steffen Grunewald wrote:
> But: If the user "forgets" to specify request_gpus (or sets it to 0),
> then CUDA_VISIBLE_DEVICES isn't set *which apparently leaves full access
> to _all_ GPU resources of the machine*. Is this intended? I'd expect 
> something like CUDA_VISIBLE_DEVICES=-1 ...

(see below)
 
> Still running 8.1.4

Here's another quirk (at least I think it's one), from the output
(printenv, machine ad, ...) of a job scheduled to the second of two
GPUs:

$ grep CUDA 3.out 
_CONDOR_AssignedGPUS=CUDA1
CUDA_VISIBLE_DEVICES=1
CUDARuntimeVersion = 5.5
CUDAGlobalMemoryMb = 4800
CUDACapability = 3.5
CUDAECCEnabled = false
CUDADriverVersion = 6.0
CUDADeviceName = "Tesla K20c"
AssignedGPUS = "CUDA1"

As $CUDA_VISIBLE_DEVICES equals 1, _only_ the second GPU would be visible,
as can be proved as follows:

$ CUDA_VISIBLE_DEVICES=1 /usr/lib/condor/libexec/condor_gpu_discovery 
DetectedGPUs="CUDA0"

Note that in the CUDA_VISIBLE_DEVICES context, the device "name" is different
from what's announced in the machine ad.

$ CUDA_VISIBLE_DEVICES=0 /usr/lib/condor/libexec/condor_gpu_discovery 
DetectedGPUs="CUDA0"

- same result, different GPU.
Maybe I'm misinterpreting stuff?

BTW,

$ CUDA_VISIBLE_DEVICES=-1 /usr/lib/condor/libexec/condor_gpu_discovery 
DetectedGPUs=0
$ CUDA_VISIBLE_DEVICES="" /usr/lib/condor/libexec/condor_gpu_discovery 
DetectedGPUs=0

but (behaviour in the case of request_gpus=0)

$ unset CUDA_VISIBLE_DEVICES; /usr/lib/condor/libexec/condor_gpu_discovery 
DetectedGPUs="CUDA0, CUDA1"

Moreover, $$(AssignedGPUS) in the "arguments" apparently isn't replaced by a CUDA*
string, as suggested by the HowToManageGPUs wiki page... 

- S