[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Fractional CPU resources possible?



Hi Carsten!

> The only caveat is that some GPU jobs have vastly different memory needs but I don't see how to shift those dynamically between "GPU" and "CPU" slots.


You might be able to take some inspiration from the ~25yo Bologna Batch System white paper and create two entirely distinct sets of slots (duplicating all slots in each), where RAM is likewise 2x overcommitted between them, but where the activation of one set of slots disables the other (or otherwise affects the RAM advertised in the other), so only one set is ever matched at a timeâ

-Peter



> On Jun 22, 2026, at 9:09âAM, Carsten Aulbert via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
> 
> Hi Cole, Thomas, all,
> 
> just to close this thread from my side with what we tried and what seems to work:
> 
> On 6/15/26 16:29, Cole Bollig wrote:
>> HTCondor does not currently support fractional CPUs. One potential solution to this is you could lie about the number of CPUs available to the EP so that the CPU cores are actually over committed. I have attached some sample configuration I put together to assist another administrator using this concept.
> 
> After not really successful trials with "virtual CPU cores", i.e. trying to lie to condor like
> 
> NUM_CPUS = $(DETECTED_CPUS_LIMIT) * 10
> 
> and using job transforms on the submit hosts like
> 
> JOB_TRANSFORM_CpuFiddle @=end
> cpu_weight_factor = 9
> IF defined MY.LittleCpu
>  cpu_weight_factor = 1
> ENDIF
> EVALSET RequestCpus RequestCpus * $(cpu_weight_factor)
> @end
> 
> (while obviously falling prey to one of the two hardest CS problems ;-))
> 
> We may have been able to tweak this approach enough to make it workable, but getting to the right multipliers and weight factors which would have to match the layout of the EP, we opted for Cole's suggested way and simply created 4 slots for each node[1]:
> 
> 06/18/26 08:00:08 slot1: New pSlot of type 1 allocated
> 06/18/26 08:00:08 slot1:        Cpus: 8.000000, Memory: 51577, Swap: 0.00%, Disk: 25.00%, GPUs: 8
> 06/18/26 08:00:08 slot2: New pSlot of type 2 allocated
> 06/18/26 08:00:08 slot2:        Cpus: 16.000000, Memory: 180519, Swap: 0.00%, Disk: 25.00%, GPUs: 0
> 06/18/26 08:00:08 slot3: New pSlot of type 3 allocated
> 06/18/26 08:00:08 slot3:        Cpus: 8.000000, Memory: 51577, Swap: 0.00%, Disk: 25.00%, GPUs: 8
> 06/18/26 08:00:08 slot4: New pSlot of type 4 allocated
> 06/18/26 08:00:08 slot4:        Cpus: 16.000000, Memory: 180519, Swap: 0.00%, Disk: 25.00%, GPUs: 0
> 
> This along with something like
> 
> SLOT_TYPE_1_START = (TARGET.RequestGpus isnt Undefined) && (TARGET.RequestGpus > 0)
> 
> for slots 1 and 3 seems to work nicely. The only caveat is that some GPU jobs have vastly different memory needs but I don't see how to shift those dynamically between "GPU" and "CPU" slots.
> 
> Anyway, yet another time condor has proved to have more than enough knobs for the job ;-)
> 
> Thanks!
> 
> Carsten
> 
> [1] As we expect quite a bit of GPU to CPU bandwidth needs, we logically divide each server into two half to minimize traffic between the CPUs, i.e. CPU0 will only talk to GPUs local to it; well that plus NUMA ;-)
> 
> -- 
> Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
> CallinstraÃe 38, 30167 Hannover, Germany, Phone +49 511 762 17185
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> 
> The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/


--
Peter Couvares
LIGO Data Analysis Computing Manager, Caltech
Executive Director, OSG Consortium