[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] GPUs with htcondor



Hi, Matthias

My understanding of MPS is that the GPU ends up being reserved for a single user, and will not allow
multiple users to share it.   Do you have a mechanism to ensure that multiple jobs that share
the GPU all come from a single user?

-tj


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Matthias Schnepf <matthias.schnepf@xxxxxxx>
Sent: Thursday, March 13, 2025 8:21 AM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] GPUs with htcondor
 

Hi all,

We can run several jobs on the same GPU and limit the GPU memory usage, at least for CUDA applications. We use the NVIDIA MPS Service [1] one per GPU-WN.


A job wrapper sets the necessary environmental variable for each job:

  • CUDA_MPS_PIPE_DIRECTORY: Location of the MPS socket. Same for each job because all have to use the same MPS service
  • CUDA_MPS_PINNED_DEVICE_MEM_LIMIT: The GPU-Memory limit is different for each job and is extracted from the classad file of the job slot (.machine.ad).

We never tried what happens when the user changes these environment variables within the job :-\


In condor, we added the machine resource GPUMemoryMB, corresponding to the GPU-Memory. To ensure that the GPU Memory is managed for each GPU separately, we put each GPU into one partitionable slot. With that, running multi-GPU jobs on several physical GPUs is no longer possible. Since we have one type of GPU per WN we can evenly distribute the GPU-Memory. It is also necessary to define the maximum number of GPU jobs allowed per GPU. We use the "-repeat" option for GPU discovery to make it possible to schedule whole GPU jobs on these machines.


The config in the STARTD:

NUM_MAX_GPU_DIVIDE = 8

NUM_GPUS = 8

DEVICE_MEMORY_PER_GPU = 32000

MACHINE_RESOURCE_GPUMemoryMB = $(DEVICE_MEMORY_PER_GPU) * $(NUM_GPUS)

GPU_DISCOVERY_EXTRA = -repeat $(NUM_MAX_GPU_DIVIDE) -packed

SLOT_TYPE_1_PARTITIONABLE = TRUE

SLOT_TYPE_1 = GPUMemoryMB=$(DEVICE_MEMORY_PER_GPU), GPUs=$(NUM_MAX_GPU_DIVIDE), auto

NUM_SLOTS_TYPE_1 = $(NUM_GPUS)


To support the default behavior: "request one GPU get one physical GPU," we do some transformation on the Schedd for single GPU jobs. When no RequestGPUMemoryMB and one GPU are requested, the job transform sets the RequestGPUMemoryMB to all the GPU memory of that slot. Therefore, no further GPU job can run on that slot. In the current condor version, it could be possible to set the default behavior in the GPU WNs instead of the Schedds via transforms.


JOB_TRANSFORM_NAMES = GPUJobs

JOB_TRANSFORM_GPUJobs @=end

   # Set default to whole GPU memory by requesting 1 GPU

   REQUIREMENTS RequestGPUs =?= 1

   DEFAULT RequestGPUMemoryMB TARGET.TotalSlotGPUMemoryMB

@end

# Check job attributes during submission

SUBMIT_REQUIREMENT_NAMES = GPUMemoryCheck

SUBMIT_REQUIREMENT_GPUMemoryCheck = ifThenElse( isUndefined(RequestGPUMemoryMB), True, RequestGPUMemoryMB > 0 && RequestGPUs=?=1)

SUBMIT_REQUIREMENT_GPUMemoryCheck_REASON = "You request GPUMemoryMB without requesting a GPU or more than one GPU!"


For more details on the performance and comparison to MIG, check out the presentation from my college Tim [2]. He created most of the configs above.


Best regards,

Matthias


[1] https://docs.nvidia.com/deploy/mps/index.html

[2] https://indico.cern.ch/event/1330797/contributions/5796656/



On 3/10/25 4:16 PM, John M Knoeller via HTCondor-users wrote:
Unfortunately, no.   HTCondor cannot enforce GPU memory limits on a process.  In fact, it cannot even tell if a job is using more GPU memory than it requested.   The GPU memory monitoring data we get from NVidia is per-GPU not per process. 

In practice, if you duplicate GPUs with the -repeat or -divide arguments to condor_gpu_discovery, The jobs have to be well behaved and should not request more than a single GPU per job, or behave as if the total GPU memory that the request is all coming from a single GPU. 

If you duplicate GPUs, HTCondor cannot prevent jobs that share a GPU from interfering with each other, or even detect that it is happening. 

-tj


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Ram Ban <ramban046@xxxxxxxxx>
Sent: Wednesday, March 5, 2025 12:30 PM
To: Jordan Corbett-Frank <jordan.Corbett-Frank@xxxxxxxxxx>
Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] GPUs with htcondor
 

Hello All,
Are there any solutions to the above issues in any Htcondor version??
Thanks
Raman


On Sat, Mar 1, 2025, 02:03 Ram Ban <ramban046@xxxxxxxxx> wrote:

I would love the feature to treat gpu ram similar to system ram and fit maximum jobs on a gpu.
My gpu ram across different jobs varies a lot, some require about less than 2GbB but some require around 10GB. I thought to divide X jobs per gpu and give proper num Gpus for jobs having higher requirement, but it is creating issues when there are multiple GPUs on a machine.

2 major issues in this I have faced are - 

1. How to put my job on hold if it is using more gpu memory than requested??(I have written custom logic for system ram using MemoryUsage variable on executor)

2. If I divide 2 jobs on each gpu each slot having ram 6GB(total gpu ram is 12GB), then my machine having 2 GPUs has 4 slots, Now first job comes requirment of 4GB and it is assigned on first slot and then another job comes with requirment of 10GB, it gets 2 slots each of 1 gpu and some how it uses first gpu ram slot more and both my jobs get terminated due to gpu ram exceeded on first gpu. I am not able to handle this as well.

Thanks 

Raman 


On Sat, Mar 1, 2025, 00:44 Jordan Corbett-Frank <jordan.Corbett-Frank@xxxxxxxxxx> wrote:
Sorry to interject, but I'd like to second the request for partitionable slots to treat GPU memory the same way they do for system memory. We'd very much like to be able to pack as many GPU-dependent jobs as possible on the same machine. While we do manually route gpu-dependent jobs to different nodes that are configured to allow a certain number of concurrent jobs to run on them, it's not as flexible as true partitionable slots. 

Thank you, 
Jordan 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Todd Tannenbaum via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Friday, February 28, 2025 1:58 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>; Ram Ban <ramban046@xxxxxxxxx>
Cc: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] GPUs with htcondor
 
On 2/28/2025 11:48 AM, Ram Ban wrote:
Hello all,

I am currently using GPUs with Htcondor 10.2 version, Are there any new features related to GPUs in latest Htcondor version which will be helpful in better scheduling?



Hi Raman,

A few new items that come to mind since v10:

1. Job submit files now have various first-class entries for specifying what type of GPU your job needs (as opposed to embedding this info into expressions).  E.g. your job submit file can now specify things like  gpus_minimum_memory,  gpus_minimum_runtime,  gpus_minimum_capability and gpus_maximum_capability.

2. GPUs and their properties will now be automatically discovered by default (no need to modify the configuration file on execution points).

3. GPUs that are not scheduled for use by a job are "hidden" from that job, even if the job clobbers/ignores environment variables like CUDA_VISIBLE_DEVICES.

4. Some helpful new commands, like "condor_submit -gpus" which shows servers with GPUs and the GPU properties

5. Improved management of GPUs accessed via Docker jobs.

6. Easier to configure an Execution Point (i.e. worker node) to prefer to run GPU jobs if any are waiting, but to "backfill" with non-GPU jobs if there are no idle GPU jobs waiting to run.

You can see a list of improvements at https://htcondor.org/htcondor/release-highlights/

Also I am using partionable slots for gpu,cpu,ram, can I use gpu ram for that as well?


Not currently, but we have been thinking about this....  Are you saying you would like HTCondor to pack as many GPU jobs onto a single GPU device until the provisioned GPU memory is exhausted?  Right now, if you know the GPU workload of your pool, what you can do is configure Execution Points to run X jobs per GPU device.

regards,
Todd
-- Todd Tannenbaum <tannenba@xxxxxxxxxxx>  University of Wisconsin-Madison Center for High Throughput Computing    Department of Computer Sciences

_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe Join us in June at Throughput Computing 25: https://osg-htc.org/htc25 The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/