Hi all, first up, thank you for HTC26 last week!For a dedicated project, we try to optimize throughput for a workload with mixed GPU and CPU parts on a pool still running v24.0. As the GPU jobs use hardly any CPU cycles, we tried to `request_cpus = 0` (or a small fraction), to allow CPU-only jobs to still match and run.
But on the EP (partionable slots), this is translated to a dynamic slot requesting 1.0 CPUs, probably via the default
MODIFY_REQUEST_EXPR_REQUESTCPUS = quantize(RequestCpus,{1})
Trying to request 0.1 or similar small values instead and setting
MODIFY_REQUEST_EXPR_REQUESTCPUS = quantize(real(RequestCpus), {0.1})
did not really help. Either jobs are not matched with the error
06/15/26 06:05:37 slot2: Job 41053226.0 requesting resources:
Cpus=0.000000, Memory=512, Disk=0.000001/1 ,GPUs=1.000000
06/15/26 06:05:37 slot2: Failed to parse attributes for request, aborting 06/15/26 06:05:37 slot2: State change: claiming protocol failedas the request is downgraded again to 0.0 or if I extend the above list, e.g. {0.1,1} jobs are then upgraded again to requesting 1.0 CPUs.
We seem to hit https://github.com/htcondor/htcondor/blob/c63ec61b3e864b2345630fe094512bb5d18f7dec/src/condor_startd.V6/Resource.cpp#L4330
but something is setting the requested fractional resources on the EP.
The jobad on the schedd is telling me:
condor_q -bet 41053226.0 -reverse-analyze -machine slot1@g6631
-- Schedd: condorhub : <10.20.50.68:9618?...
-- Slot: slot1@g6631 : Analyzing matches for 1 Jobs in 1 autoclusters
The Requirements expression for this slot is
START &&
(WithinResourceLimits)
START is
true
WithinResourceLimits is
(MY.Cpus > 0 &&
TARGET.RequestCpus <= MY.Cpus && MY.Memory > 0 &&
TARGET.RequestMemory <= MY.Memory && MY.Disk > 0 &&
TARGET.RequestDisk <= MY.Disk && (TARGET.RequestGPUs is undefined ||
MY.GPUs >= TARGET.RequestGPUs))
This slot defines the following attributes:
Cpus = 6
Disk = 1727076296
GPUs = 3
Memory = 89760
Job 41053226.0 has the following attributes:
TARGET.RequestCpus = 0.15
TARGET.RequestDisk = 27
TARGET.RequestGPUs = 1
TARGET.RequestMemory = 512
The Requirements expression for this slot reduces to these conditions:
Clusters
Step Matched Condition
----- -------- ---------
[2] 1 TARGET.RequestCpus <= MY.Cpus
[6] 1 TARGET.RequestMemory <= MY.Memory
[10] 1 TARGET.RequestDisk <= MY.Disk
[13] 1 MY.GPUs >= TARGET.RequestGPUs
slot1@g6631: Run analysis summary of 1 jobs.
1 (100.00 %) match both slot and job requirements.
1 match the requirements of this slot.
1 have job requirements that match this slot.
I do hope, what I've written makes at least some sense and there is a
way to achieve getting the GPU jobs to run in parallel to CPU jobs.
Is there? Cheers and thanks a lot in advance Carsten -- Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics, CallinstraÃe 38, 30167 Hannover, Germany, Phone +49 511 762 17185
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature