[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Fractional CPU resources possible?



Hi all,

first up, thank you for HTC26 last week!

For a dedicated project, we try to optimize throughput for a workload with mixed GPU and CPU parts on a pool still running v24.0. As the GPU jobs use hardly any CPU cycles, we tried to `request_cpus = 0` (or a small fraction), to allow CPU-only jobs to still match and run.

But on the EP (partionable slots), this is translated to a dynamic slot requesting 1.0 CPUs, probably via the default

MODIFY_REQUEST_EXPR_REQUESTCPUS = quantize(RequestCpus,{1})

Trying to request 0.1 or similar small values instead and setting

MODIFY_REQUEST_EXPR_REQUESTCPUS = quantize(real(RequestCpus), {0.1})

did not really help. Either jobs are not matched with the error

06/15/26 06:05:37 slot2: Job 41053226.0 requesting resources: Cpus=0.000000, Memory=512, Disk=0.000001/1 ,GPUs=1.000000
06/15/26 06:05:37 slot2: Failed to parse attributes for request, aborting
06/15/26 06:05:37 slot2: State change: claiming protocol failed

as the request is downgraded again to 0.0 or if I extend the above list, e.g. {0.1,1} jobs are then upgraded again to requesting 1.0 CPUs.

We seem to hit https://github.com/htcondor/htcondor/blob/c63ec61b3e864b2345630fe094512bb5d18f7dec/src/condor_startd.V6/Resource.cpp#L4330

but something is setting the requested fractional resources on the EP.

The jobad on the schedd is telling me:

condor_q -bet 41053226.0 -reverse-analyze -machine slot1@g6631


-- Schedd: condorhub : <10.20.50.68:9618?...

-- Slot: slot1@g6631 : Analyzing matches for 1 Jobs in 1 autoclusters

The Requirements expression for this slot is

    START &&
    (WithinResourceLimits)

  START is
    true

  WithinResourceLimits is
    (MY.Cpus > 0 &&
      TARGET.RequestCpus <= MY.Cpus && MY.Memory > 0 &&
      TARGET.RequestMemory <= MY.Memory && MY.Disk > 0 &&
      TARGET.RequestDisk <= MY.Disk && (TARGET.RequestGPUs is undefined ||
        MY.GPUs >= TARGET.RequestGPUs))

This slot defines the following attributes:

    Cpus = 6
    Disk = 1727076296
    GPUs = 3
    Memory = 89760

Job 41053226.0 has the following attributes:

    TARGET.RequestCpus = 0.15
    TARGET.RequestDisk = 27
    TARGET.RequestGPUs = 1
    TARGET.RequestMemory = 512

The Requirements expression for this slot reduces to these conditions:

       Clusters
Step    Matched  Condition
-----  --------  ---------
[2]           1  TARGET.RequestCpus <= MY.Cpus
[6]           1  TARGET.RequestMemory <= MY.Memory
[10]          1  TARGET.RequestDisk <= MY.Disk
[13]          1  MY.GPUs >= TARGET.RequestGPUs

slot1@g6631: Run analysis summary of 1 jobs.
    1 (100.00 %) match both slot and job requirements.
    1 match the requirements of this slot.
    1 have job requirements that match this slot.


I do hope, what I've written makes at least some sense and there is a way to achieve getting the GPU jobs to run in parallel to CPU jobs.

Is there?

Cheers and thanks a lot in advance

Carsten
--
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany, Phone +49 511 762 17185

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature