[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] GPUs with htcondor



Sorry to interject, but I'd like to second the request for partitionable slots to treat GPU memory the same way they do for system memory. We'd very much like to be able to pack as many GPU-dependent jobs as possible on the same machine. While we do manually route gpu-dependent jobs to different nodes that are configured to allow a certain number of concurrent jobs to run on them, it's not as flexible as true partitionable slots. 

Thank you, 
Jordan 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Todd Tannenbaum via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Friday, February 28, 2025 1:58 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>; Ram Ban <ramban046@xxxxxxxxx>
Cc: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] GPUs with htcondor
 
On 2/28/2025 11:48 AM, Ram Ban wrote:
Hello all,

I am currently using GPUs with Htcondor 10.2 version, Are there any new features related to GPUs in latest Htcondor version which will be helpful in better scheduling?



Hi Raman,

A few new items that come to mind since v10:

1. Job submit files now have various first-class entries for specifying what type of GPU your job needs (as opposed to embedding this info into expressions).  E.g. your job submit file can now specify things like  gpus_minimum_memory,  gpus_minimum_runtime,  gpus_minimum_capability and gpus_maximum_capability.

2. GPUs and their properties will now be automatically discovered by default (no need to modify the configuration file on execution points).

3. GPUs that are not scheduled for use by a job are "hidden" from that job, even if the job clobbers/ignores environment variables like CUDA_VISIBLE_DEVICES.

4. Some helpful new commands, like "condor_submit -gpus" which shows servers with GPUs and the GPU properties

5. Improved management of GPUs accessed via Docker jobs.

6. Easier to configure an Execution Point (i.e. worker node) to prefer to run GPU jobs if any are waiting, but to "backfill" with non-GPU jobs if there are no idle GPU jobs waiting to run.

You can see a list of improvements at https://htcondor.org/htcondor/release-highlights/

Also I am using partionable slots for gpu,cpu,ram, can I use gpu ram for that as well?


Not currently, but we have been thinking about this....  Are you saying you would like HTCondor to pack as many GPU jobs onto a single GPU device until the provisioned GPU memory is exhausted?  Right now, if you know the GPU workload of your pool, what you can do is configure Execution Points to run X jobs per GPU device.

regards,
Todd
-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx>  University of Wisconsin-Madison
Center for High Throughput Computing    Department of Computer Sciences