Hi Eric,
we also have a number of jobs on our cluster that do not use a full GPU. We ended up with a solution that is rather specialized to our use case, but maybe it happens to align with yours. For each GPU on a machine, we have one partitionable job slot. This is one limitation to our approach, as it means we have to associate a certain fraction of RAM and CPU cores to each GPU, and that the same machine cannot run jobs that require multiple GPUs. We add an additional resource to the job slot, which we name GPUMemory. A user can require either a full GPU as normal or a certain amount of GPU memory (or of course neither for non-GPU jobs). In our job start _expression_ we make sure these things don't collide, i.e. a full GPU can't be requested if part of its memory is already used and vice-versa. The job slot also has a configuration variable that identifies the associated GPU, which is used to set the CUDA_VISIBLE_DEVICES environment variable in a user job wrapper. Finally, we have a monitoring script for the used GPU memory, so that condor kills jobs using more memory than they requested. In case this sounds like something that would make sense for you, I can collect the configuration parts and share them here. Best regards, Yannik Von: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> im Auftrag von John M Knoeller <johnkn@xxxxxxxxxxx>
Gesendet: Dienstag, 24. November 2020 18:08 An: HTCondor-Users Mail List Betreff: Re: [HTCondor-users] Running multiple jobs simultaneously on a single GPU Hi Eric.
Nvidia is adding the ability to share a GPU between processes in newer hardware with hardware enforcement of memory isolation between processes. HTCondor does plan to support that but it does not yet, and I don’t think the NVida devices that support this are very common yet. This is work in progress…
However, You can share a GPU between processes *without* any kind protection between processes just by having more than a single process set the environment variable CUDA_VISIBLE_DEVICES to the same value
You can get HTCondor to do this just by having the same device show up more than once in the device enumeration.
For instance, if you have two GPUs and your configuration is
MACHINE_RESOURCE_GPUS = CUDA0, CUDA1
You can run two jobs on each GPU by configuring
MACHINE_RESOURCE_GPUS = CUDA0, CUDA1, CUDA0, CUDA1
If you don’t use the MACHINE_RESOURCE_GPUS knob, and instead use HTCondor’s GPU detection, you can use the same trick, it’s just a little more work.
# enable GPU discovery use FEATURE : GPUs # then override the GPU device enumeration with a wrapper script that duplicates the detected GPUs MACHINE_RESOURCE_INVENTORY_GPUs = $(ETC)/bin/condor_gpu_discovery.sh $(1) -properties $(GPU_DISCOVERY_EXTRA)
The wrapper script $(ETC)/bin/condor_gpu_discovery.sh is something that you need to write.
condor_gpu_discovery produces output like this
DetectedGPUs="CUDA0, CUDA1" CUDACapability=6.0 CUDADeviceName="Tesla P100-PCIE-16GB" CUDADriverVersion=11.0 CUDAECCEnabled=true CUDAGlobalMemoryMb=16281 CUDAMaxSupportedVersion=11000 CUDA0DevicePciBusId="0000:3B:00.0" CUDA0DeviceUuid="dddddddd-dddd-dddd-dddd-dddddddddddd" CUDA1DevicePciBusId="0000:D8:00.0" CUDA1DeviceUuid="cccccccc-cccc-cccc-cccc-cccccccccccc"
Your wrapper script should produce the same output, but with a modified value for DetectedGPUs like this
DetectedGPUs="CUDA0, CUDA1, CUDA0, CUDA1" CUDACapability=6.0 CUDADeviceName="Tesla P100-PCIE-16GB" CUDADriverVersion=11.0 CUDAECCEnabled=true CUDAGlobalMemoryMb=16281 CUDAMaxSupportedVersion=11000 CUDA0DevicePciBusId="0000:3B:00.0" CUDA0DeviceUuid="dddddddd-dddd-dddd-dddd-dddddddddddd" CUDA1DevicePciBusId="0000:D8:00.0" CUDA1DeviceUuid="cccccccc-cccc-cccc-cccc-cccccccccccc"
-tj
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of
Eric Sedore via HTCondor-users
Good evening everyone,
I’ve listened to a few presentations that mentioned there is a way (either ready now or planned) to allow multiple jobs to utilize a single GPU. This would be helpful as we have a number of workloads/jobs that do not consume the entire GPU (memory or processing). Is there documentation (apologies if I missed it) that would assist with how to set up this configuration?
Happy to provide more of a description if my question is not clear.
Thanks, -Eric |