Sorry to interject, but I'd like to second the request for partitionable slots to treat GPU memory the same way they do for system memory. We'd very much like to be able to pack as many GPU-dependent jobs as possible on the same machine. While we do manually
route gpu-dependent jobs to different nodes that are configured to allow a certain number of concurrent jobs to run on them, it's not as flexible as true partitionable slots.
Thank you,
Jordan
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Todd Tannenbaum via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Friday, February 28, 2025 1:58 PM To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>; Ram Ban <ramban046@xxxxxxxxx> Cc: Todd Tannenbaum <tannenba@xxxxxxxxxxx> Subject: Re: [HTCondor-users] GPUs with htcondor On 2/28/2025 11:48 AM, Ram Ban wrote:
Hi Raman, A few new items that come to mind since v10: 1. Job submit files now have various first-class entries for specifying what type of GPU your job needs (as opposed to embedding this info into expressions). E.g. your job submit file can now specify things like gpus_minimum_memory, gpus_minimum_runtime, gpus_minimum_capability and gpus_maximum_capability. 2. GPUs and their properties will now be automatically discovered by default (no need to modify the configuration file on execution points). 3. GPUs that are not scheduled for use by a job are "hidden" from that job, even if the job clobbers/ignores environment variables like CUDA_VISIBLE_DEVICES. 4. Some helpful new commands, like "condor_submit -gpus" which shows servers with GPUs and the GPU properties 5. Improved management of GPUs accessed via Docker jobs. 6. Easier to configure an Execution Point (i.e. worker node) to prefer to run GPU jobs if any are waiting, but to "backfill" with non-GPU jobs if there are no idle GPU jobs waiting to run. You can see a list of improvements at https://htcondor.org/htcondor/release-highlights/
Not currently, but we have been thinking about this.... Are you saying you would like HTCondor to pack as many GPU jobs onto a single GPU device until the provisioned GPU memory is exhausted? Right now, if you know the GPU workload of your pool, what you can do is configure Execution Points to run X jobs per GPU device. regards, Todd -- Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison Center for High Throughput Computing Department of Computer Sciences |