Hi Cole, Thomas, all,just to close this thread from my side with what we tried and what seems to work:
On 6/15/26 16:29, Cole Bollig wrote:
HTCondor does not currently support fractional CPUs. One potential solution to this is you could lie about the number of CPUs available to the EP so that the CPU cores are actually over committed. I have attached some sample configuration I put together to assist another administrator using this concept.
After not really successful trials with "virtual CPU cores", i.e. trying to lie to condor like
NUM_CPUS = $(DETECTED_CPUS_LIMIT) * 10 and using job transforms on the submit hosts like JOB_TRANSFORM_CpuFiddle @=end cpu_weight_factor = 9 IF defined MY.LittleCpu cpu_weight_factor = 1 ENDIF EVALSET RequestCpus RequestCpus * $(cpu_weight_factor) @end (while obviously falling prey to one of the two hardest CS problems ;-))We may have been able to tweak this approach enough to make it workable, but getting to the right multipliers and weight factors which would have to match the layout of the EP, we opted for Cole's suggested way and simply created 4 slots for each node[1]:
06/18/26 08:00:08 slot1: New pSlot of type 1 allocated06/18/26 08:00:08 slot1: Cpus: 8.000000, Memory: 51577, Swap: 0.00%, Disk: 25.00%, GPUs: 8
06/18/26 08:00:08 slot2: New pSlot of type 2 allocated06/18/26 08:00:08 slot2: Cpus: 16.000000, Memory: 180519, Swap: 0.00%, Disk: 25.00%, GPUs: 0
06/18/26 08:00:08 slot3: New pSlot of type 3 allocated06/18/26 08:00:08 slot3: Cpus: 8.000000, Memory: 51577, Swap: 0.00%, Disk: 25.00%, GPUs: 8
06/18/26 08:00:08 slot4: New pSlot of type 4 allocated06/18/26 08:00:08 slot4: Cpus: 16.000000, Memory: 180519, Swap: 0.00%, Disk: 25.00%, GPUs: 0
This along with something likeSLOT_TYPE_1_START = (TARGET.RequestGpus isnt Undefined) && (TARGET.RequestGpus > 0)
for slots 1 and 3 seems to work nicely. The only caveat is that some GPU jobs have vastly different memory needs but I don't see how to shift those dynamically between "GPU" and "CPU" slots.
Anyway, yet another time condor has proved to have more than enough knobs for the job ;-)
Thanks! Carsten[1] As we expect quite a bit of GPU to CPU bandwidth needs, we logically divide each server into two half to minimize traffic between the CPUs, i.e. CPU0 will only talk to GPUs local to it; well that plus NUMA ;-)
-- Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics, CallinstraÃe 38, 30167 Hannover, Germany, Phone +49 511 762 17185
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature