That is good to hear.
To fix your problem with CPU jobs running on slots with GPUs. you can add a clause to your START _expression_ that only gets used when the slot is dynamic that compares the number of GPUs in the slot to RequestGPUs. Sort of the inverse of having your start_cpu_jobs
clause that only gets used when the slot is not dynamic.
gpu_jobs_on_gpu_slots = (GPUs ?: 0) == (TARGET.RequestGpus ?: 0)
START = $(START) && (PartitionableSlot ?: $(gpu_jobs_on_gpu_slots) )
Note that this will also prevent matching jobs that want 1 GPU on slots that have 2 GPUs.
-tj
From: K._Scott Rowe <krowe@xxxxxxxx>
Sent: Wednesday, August 27, 2025 4:50 PM To: John M Knoeller <johnkn@xxxxxxxxxxx>; HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> Subject: Re: [HTCondor-users] How to reserve resources for GPU jobs {External} Thanks. I think I have it working now.
Here is what I ended up using to reserve 257GB of space for GPU jobs start_cpu_jobs = (Memory - quantize(TARGET.RequestMemory, {128}) ) > (257*1024) START = $(START) && ((TARGET.RequestGPUs ?: 0) || (DynamicSlot ?: $(start_cpu_jobs))) There was one little problem, when I implemented your solution, I could run a gpu job but when that job finished an idle cpu job would take the dynamic slot created by the gpu job. This prevented any other gpu jobs from running until all the cpu jobs finished. I solved this by setting CLAIM_WORKLIFE = 0 Seems to work in testing. I will put it into production later this week. Thanks again On 8/26/25 17:25, John M Knoeller wrote: > Memory should be the amount of memory on the slot for Static and > dynamic slots. > For Partitionable slots it is the amount of Memory that has not yet > been moved to a dynamic slot > under that Partitionable slot. (i.e. free memory) > > But there is a flaw in my second suggestion. > > We want the start_cpu_jobs test to apply only to the partitionable > slot, and not to the dynamic slots > created under it, otherwise the dynamic slots may not match the jobs > we just created them for. > This is probably what you are seeing. > > To add a test for the dynamic slot you can do it inside start_cpu_jobs > > start_cpu_jobs = ( DynamicSlot ?: (Memory - TARGET.RequestMemory) > > 1024 ) > > Or it might be better to do it outside start_cpu_jobs. > > start_cpu_jobs = ((Memory - TARGET.RequestMemory) > 1024) > START = $(START) && ( DynamicSlot ?: $(start_cpu_jobs) ) > > Written this way, start_cpu_jobs is not evaluated for dynamic slots, > only for partitionable slots. It > controls the creation of dynamic slots while looking at the free > resources of the partitionable slot. > > And actually there is another refinement, since RequestMemory is > rounded up to the next 128 when used > you should really do this. > > start_cpu_jobs = (Memory - quantize(TARGET.RequestMemory, {128}) ) > > 1024 > > Note that the way DynamicSlot is used above, the START _expression_ > won't work if you are not using partitionable slots in your > configuration. > > If you are using STATIC slots, you would be better off just refusing > to match CPU jobs on the slots that have GPUs. > > -tj > > ------------------------------------------------------------------------ > *From:* K._Scott Rowe <krowe@xxxxxxxx> > *Sent:* Tuesday, August 26, 2025 2:56 PM > *To:* HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> > *Cc:* John M Knoeller <johnkn@xxxxxxxxxxx> > *Subject:* Re: [HTCondor-users] How to reserve resources for GPU jobs > {External} > > Thanks. Your first suggestion that blocks all non-gpu jobs works. Your > second suggestion to allow some non-gpu jobs doesn't work. > > > Is the Memory variable in your example the amount of available memory on > the node? Because it seems to act more like the amount of memory > requested by the job. For example, if I add these two lines, simplified > from your suggestion, to my config > > > start_cpu_jobs = (Memory >= 1023) > START = $(START) && $(start_cpu_jobs) > > > and submit non-gpu a job asking for 1GB (request_memory = 1 G) of > memory, the job runs. But if I set > > > start_cpu_jobs = (Memory >= 1025) > START = $(START) && $(start_cpu_jobs) > > > and submit the same non-gpu job, it stays idle, even though "condor_q > -better" tells me there is 1 machine > > able to run my job. > > > Thanks > > > > I get just one return, when there are no jobs running > > On 8/25/25 16:37, John M Knoeller via HTCondor-users wrote: > > If you want your machine that has GPUs to match only jobs that request > > GPUs, set > > > > > > START = (TARGET.RequestGPUs ?: 0) > 0 > > > > This simplifies to > > > > START = (TARGET.RequestGPUs ?: 0) > > > > With the above START _expression_, only jobs that request at least 1 GPU > > will match. That's not quite what you asked for, > > but it shows the way. you just need the START _expression_ to evaluate > > to false for cpu jobs while there is still memory > > and cpus available. > > > > I will show this using a temp variable to hold the CPU jobs _expression_. > > > > start_cpu_jobs = (Cpus - TARGET.RequestCpus) >= 1 && (Memory - > > TARGET.RequestMemory) >= (128+1024) > > START = IfThenElse(TARGET.RequestGPUs ?: 0, true, $(start_cpu_jobs) ) > > > > This simplifies to > > > > START = (TARGET.RequestGPUs ?: 0) || $(start_cpu_jobs) > > > > note that if you already have a START _expression_ that is not just > > TRUE, this should be > > > > START = $(START) && ( (TARGET.RequestGPUs ?: 0) || $(start_cpu_jobs) ) > > > > -tj > > > > ------------------------------------------------------------------------ > > *From:* HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf > > of K._Scott Rowe <krowe@xxxxxxxx> > > *Sent:* Monday, August 25, 2025 4:30 PM > > *To:* htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx> > > *Subject:* [HTCondor-users] How to reserve resources for GPU jobs > > > > Hey there. Imagine I have an EP running HTCondor-23.0.17 with 24 cores, > > 512GB RAM, and one GPU. There are many CPU-only jobs running on this EP > > for weeks at a time, and there are usually one or two GPU jobs as well. > > The CPU-only jobs may take weeks to finish, so sadly a GPU job may have > > to wait weeks to start. I would like GPU jobs to not have to wait so > > long. > > > > Is there a way I could reserve say 1 core and 128GB of RAM for GPU jobs, > > and only GPU jobs, on this EP thus letting CPU-only jobs continue to run > > on the other 23 cores and 384GB of RAM? > > > > I have been trying to do this with static slots but have not figured out > > how to make a slot that has the GPU as a resource and will NOT run > > CPU-only jobs. > > > > I should also mention that we don't use preemtion and really don't want > > to use it as it doesn't work well with our pipeline. I would also > > rather not ask our users to add a ClassAd to their submit scripts (e.g. > > +IsGPUJob), but if that is the only way, then so be it. > > > > Thanks > > > > -- > > > > K. Scott Rowe -- Science Information Services > > Science Operations Center, National Radio Astronomy Observatory > > 1011 Lopezville Socorro, NM 87801 > > krowe@xxxxxxxx -- 1.575.835.7193 -- > > > https://urldefense.com/v3/__http://www.nrao.edu__;!!Mak6IKo!IshyrPRFTwy-zul-FivGEH-AsRP62e2ZafRLF_z6yc9_EYrjmi_JJ2eWbBMvgyT5eEmI2GcxD7UOAn13$ > <https://urldefense.com/v3/__http://www.nrao.edu__;!!Mak6IKo!IshyrPRFTwy-zul-FivGEH-AsRP62e2ZafRLF_z6yc9_EYrjmi_JJ2eWbBMvgyT5eEmI2GcxD7UOAn13$> > > > <https://urldefense.com/v3/__http://www.nrao.edu__;!!Mak6IKo!IshyrPRFTwy-zul-FivGEH-AsRP62e2ZafRLF_z6yc9_EYrjmi_JJ2eWbBMvgyT5eEmI2GcxD7UOAn13$ > <https://urldefense.com/v3/__http://www.nrao.edu__;!!Mak6IKo!IshyrPRFTwy-zul-FivGEH-AsRP62e2ZafRLF_z6yc9_EYrjmi_JJ2eWbBMvgyT5eEmI2GcxD7UOAn13$>> > > > > _______________________________________________ > > HTCondor-users mailing list > > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx > > with a > > subject: Unsubscribe > > > > The archives can be found at: > > https://www-auth.cs.wisc.edu/lists/htcondor-users/ > <https://www-auth.cs.wisc.edu/lists/htcondor-users/ > > > <https://www-auth.cs.wisc.edu/lists/htcondor-users/ > > > > > _______________________________________________ > > HTCondor-users mailing list > > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx > with a > > subject: Unsubscribe > > > > The archives can be found at: > https://www-auth.cs.wisc.edu/lists/htcondor-users/ > <https://www-auth.cs.wisc.edu/lists/htcondor-users/ > |