[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] How to reserve resources for GPU jobs {External}



Thanks. I think I have it working now.


Here is what I ended up using to reserve 257GB of space for GPU jobs


 start_cpu_jobs = (Memory - quantize(TARGET.RequestMemory, {128}) ) > (257*1024)  START = $(START) && ((TARGET.RequestGPUs ?: 0) || (DynamicSlot ?: $(start_cpu_jobs)))

There was one little problem, when I implemented your solution, I could run a gpu job but when that job finished an idle cpu job would take the dynamic slot created by the gpu job. This prevented any other gpu jobs from running until all the cpu jobs finished. I solved this by setting


 CLAIM_WORKLIFE = 0


Seems to work in testing. I will put it into production later this week.


Thanks again


On 8/26/25 17:25, John M Knoeller wrote:
Memory should be the amount of memory on the slot for Static and dynamic slots. For Partitionable slots it is the amount of Memory that has not yet been moved to a dynamic slot
under that Partitionable slot. (i.e. free memory)

But there is a flaw in my second suggestion.

We want the start_cpu_jobs test to apply only to the partitionable slot, and not to the dynamic slots created under it, otherwise the dynamic slots may not match the jobs we just created them for.
This is probably what you are seeing.

To add a test for the dynamic slot you can do it inside start_cpu_jobs

 Âstart_cpu_jobs = ( DynamicSlot ?: (Memory - TARGET.RequestMemory) > 1024 )

Or it might be better to do it outside start_cpu_jobs.

 Âstart_cpu_jobs = ((Memory - TARGET.RequestMemory) > 1024)
 ÂSTART = $(START) && ( DynamicSlot ?: $(start_cpu_jobs) )

Written this way, start_cpu_jobs is not evaluated for dynamic slots, only for partitionable slots. It controls the creation of dynamic slots while looking at the free resources of the partitionable slot.

And actually there is another refinement, since RequestMemory is rounded up to the next 128 when used
you should really do this.

 Âstart_cpu_jobs = (Memory - quantize(TARGET.RequestMemory, {128}) ) > 1024

Note that the way DynamicSlot is used above, the START expression won't work if you are not using partitionable slots in your configuration.

If you are using STATIC slots, you would be better off just refusing to match CPU jobs on the slots that have GPUs.

-tj

------------------------------------------------------------------------
*From:*ÂK._Scott Rowe <krowe@xxxxxxxx>
*Sent:*ÂTuesday, August 26, 2025 2:56 PM
*To:*ÂHTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
*Cc:*ÂJohn M Knoeller <johnkn@xxxxxxxxxxx>
*Subject:*ÂRe: [HTCondor-users] How to reserve resources for GPU jobs {External}

Thanks. Your first suggestion that blocks all non-gpu jobs works. Your
second suggestion to allow some non-gpu jobs doesn't work.


Is the Memory variable in your example the amount of available memory on
the node? Because it seems to act more like the amount of memory
requested by the job. For example, if I add these two lines, simplified
from your suggestion, to my config


ÂÂ start_cpu_jobs = (Memory >= 1023)
ÂÂ START = $(START) && $(start_cpu_jobs)


and submit non-gpu a job asking for 1GB (request_memory = 1 G) of
memory, the job runs. But if I set


ÂÂ start_cpu_jobs = (Memory >= 1025)
ÂÂ START = $(START) && $(start_cpu_jobs)


and submit the same non-gpu job, it stays idle, even though "condor_q
-better" tells me there is 1 machine

able to run my job.


Thanks



I get just one return, when there are no jobs running

On 8/25/25 16:37, John M Knoeller via HTCondor-users wrote:
> If you want your machine that has GPUs to match only jobs that request
> GPUs, set
>
>
> Â Â START = (TARGET.RequestGPUs ?: 0) > 0
>
> This simplifies to
>
> Â Â START = (TARGET.RequestGPUs ?: 0)
>
> With the above START expression, only jobs that request at least 1 GPU
> will match. ÂThat's not quite what you asked for,
> but it shows the way. you just need the START expression to evaluate
> to false for cpu jobs while there is still memory
> and cpus available.
>
> I will show this using a temp variable to hold the CPU jobs expression.
>
> Â Âstart_cpu_jobs = (Cpus - TARGET.RequestCpus) >= 1 && (Memory -
> TARGET.RequestMemory) >= (128+1024)
> Â START = IfThenElse(TARGET.RequestGPUs ?: 0, true, $(start_cpu_jobs) )
>
> This simplifies to
>
> Â Â START = (TARGET.RequestGPUs ?: 0) || $(start_cpu_jobs)
>
> note that if you already have a START expression that is not just
> TRUE, this should be
>
> START = $(START) && ( (TARGET.RequestGPUs ?: 0) || $(start_cpu_jobs)Â )
>
> -tj
>
> ------------------------------------------------------------------------
> *From:*ÂHTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf
> of K._Scott Rowe <krowe@xxxxxxxx>
> *Sent:*ÂMonday, August 25, 2025 4:30 PM
> *To:*Âhtcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
> *Subject:*Â[HTCondor-users] How to reserve resources for GPU jobs
>
> Hey there. Imagine I have an EP running HTCondor-23.0.17 with 24 cores,
> 512GB RAM, and one GPU. There are many CPU-only jobs running on this EP
> for weeks at a time, and there are usually one or two GPU jobs as well.
> The CPU-only jobs may take weeks to finish, so sadly a GPU job may have
> to wait weeks to start. I would like GPU jobs to not have to wait so
> long.
>
> Is there a way I could reserve say 1 core and 128GB of RAM for GPU jobs,
> and only GPU jobs, on this EP thus letting CPU-only jobs continue to run
> on the other 23 cores and 384GB of RAM?
>
> I have been trying to do this with static slots but have not figured out
> how to make a slot that has the GPU as a resource and will NOT run
> CPU-only jobs.
>
> I should also mention that we don't use preemtion and really don't want
> to use it as it doesn't work well with our pipeline. I would also
> rather not ask our users to add a ClassAd to their submit scripts (e.g.
> +IsGPUJob), but if that is the only way, then so be it.
>
> Thanks
>
> --
>
> K. Scott Rowe -- Science Information Services
> Science Operations Center, National Radio Astronomy Observatory
> 1011 Lopezville Socorro, NM 87801
> krowe@xxxxxxxx -- 1.575.835.7193 --
> https://urldefense.com/v3/__http://www.nrao.edu__;!!Mak6IKo!IshyrPRFTwy-zul-FivGEH-AsRP62e2ZafRLF_z6yc9_EYrjmi_JJ2eWbBMvgyT5eEmI2GcxD7UOAn13$ <https://urldefense.com/v3/__http://www.nrao.edu__;!!Mak6IKo!IshyrPRFTwy-zul-FivGEH-AsRP62e2ZafRLF_z6yc9_EYrjmi_JJ2eWbBMvgyT5eEmI2GcxD7UOAn13$> > <https://urldefense.com/v3/__http://www.nrao.edu__;!!Mak6IKo!IshyrPRFTwy-zul-FivGEH-AsRP62e2ZafRLF_z6yc9_EYrjmi_JJ2eWbBMvgyT5eEmI2GcxD7UOAn13$ <https://urldefense.com/v3/__http://www.nrao.edu__;!!Mak6IKo!IshyrPRFTwy-zul-FivGEH-AsRP62e2ZafRLF_z6yc9_EYrjmi_JJ2eWbBMvgyT5eEmI2GcxD7UOAn13$>>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
>
> The archives can be found at:
> https://www-auth.cs.wisc.edu/lists/htcondor-users/ <https://www-auth.cs.wisc.edu/lists/htcondor-users/>
> <https://www-auth.cs.wisc.edu/lists/htcondor-users/ >
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
>
> The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/ <https://www-auth.cs.wisc.edu/lists/htcondor-users/>