[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] possible to round dslot size up in condor_start*?



Hi,

one of our users has jobs with hard to predict memory requirements and upon job failure iteratively increases memory requirements. But dslot generation seems to fail when the are non integer parts in the memory request (condor version 25.0.5-1+deb13, excerpt from StartLog):

01/14/26 07:17:24 bind slot DevIds tag=GPUs contraint=
01/14/26 07:17:24 slot1_1: New dSlot of type 1 allocated
01/14/26 07:17:24 slot1_1: Cpus: 1.00, Memory: 6941, Swap: 0.00%, Disk: 0.00%, GPUs: 0
01/14/26 07:17:24 slot1_1: Slot Requirements not satisfied. Analysis:
Step   Matched  Condition
----- --------- ---------
[1]           1  KillableJob is true
[10]          0  TARGET.RequestMemory <= MY.Memory
[17]          1  MY.GPUs >= TARGET.RequestGPUs
--------------------------
  Slot slot1_1 has the following attributes:
    Start = true &&
        (KillableJob is true || RequestGPUs > 0)
    WithinResourceLimits = (MY.Cpus > 0 && TARGET.RequestCpus <= MY.Cpus &&
MY.Memory > 0 && TARGET.RequestMemory <= MY.Memory && MY.Disk > 0 && TARGET.RequestDisk <= MY.Disk &&
      (TARGET.RequestGPUs is undefined || MY.GPUs >= TARGET.RequestGPUs))
    Cpus = 1
    Disk = 2850
    GPUs = 0
    Memory = 6941
    Requirements = Start && (WithinResourceLimits)
  Job 30691.0 has the following attributes:
    TARGET.KillableJob = true
    TARGET.RequestCpus = 1
    TARGET.RequestDisk = 1024
    TARGET.RequestGPUs = 0
    TARGET.RequestMemory = 6941.03
01/14/26 07:17:24 slot1_1: Request to create dslot resource refused.
01/14/26 07:17:24 slot1_1: Slot slot1_1 no longer needed, deleting
01/14/26 07:17:24 slot1: State change: claiming protocol failed
01/14/26 07:17:24 slot1: Changing state: Unclaimed -> Owner
01/14/26 07:17:24 slot1: State change: IS_OWNER is false
01/14/26 07:17:24 slot1: Changing state: Owner -> Unclaimed

Obviously, I'll ask the user(s) to apply a floor function to be safe, but would it be possible to simply use ceil() within Condor to mitigate this as well? Or is there a knob to change this within our configuration which I failed to find?

Cheers

Carsten

--
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany, Phone +49 511 762 17185

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature