[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] possible to round dslot size up in condor_start*?



Hello Carsten,

This problem was recently discovered and the fix for it is in the release scheduled for the end of the month.

https://opensciencegrid.atlassian.net/browse/HTCONDOR-3423

...Tim

On 1/14/26 01:29, Carsten Aulbert wrote:
Hi,

one of our users has jobs with hard to predict memory requirements and upon job failure iteratively increases memory requirements. But dslot generation seems to fail when the are non integer parts in the memory request (condor version 25.0.5-1+deb13, excerpt from StartLog):

01/14/26 07:17:24 bind slot DevIds tag=GPUs contraint=
01/14/26 07:17:24 slot1_1: New dSlot of type 1 allocated
01/14/26 07:17:24 slot1_1:ÂÂÂÂÂ Cpus: 1.00, Memory: 6941, Swap: 0.00%, Disk: 0.00%, GPUs: 0
01/14/26 07:17:24 slot1_1: Slot Requirements not satisfied. Analysis:
Step Matched Condition
----- --------- ---------
[1]ÂÂÂÂÂÂÂÂÂÂ 1Â KillableJob is true
[10]ÂÂÂÂÂÂÂÂÂ 0Â TARGET.RequestMemory <= MY.Memory
[17]ÂÂÂÂÂÂÂÂÂ 1Â MY.GPUs >= TARGET.RequestGPUs
--------------------------
 Slot slot1_1 has the following attributes:
ÂÂÂ Start = true &&
ÂÂÂÂÂÂÂ (KillableJob is true || RequestGPUs > 0)
ÂÂÂ WithinResourceLimits = (MY.Cpus > 0 && TARGET.RequestCpus <= MY.Cpus && ÂÂÂÂÂ MY.Memory > 0 && TARGET.RequestMemory <= MY.Memory && MY.Disk > 0 && TARGET.RequestDisk <= MY.Disk &&
ÂÂÂÂÂ (TARGET.RequestGPUs is undefined || MY.GPUs >= TARGET.RequestGPUs))
ÂÂÂ Cpus = 1
ÂÂÂ Disk = 2850
ÂÂÂ GPUs = 0
ÂÂÂ Memory = 6941
ÂÂÂ Requirements = Start && (WithinResourceLimits)
 Job 30691.0 has the following attributes:
ÂÂÂ TARGET.KillableJob = true
ÂÂÂ TARGET.RequestCpus = 1
ÂÂÂ TARGET.RequestDisk = 1024
ÂÂÂ TARGET.RequestGPUs = 0
ÂÂÂ TARGET.RequestMemory = 6941.03
01/14/26 07:17:24 slot1_1: Request to create dslot resource refused.
01/14/26 07:17:24 slot1_1: Slot slot1_1 no longer needed, deleting
01/14/26 07:17:24 slot1: State change: claiming protocol failed
01/14/26 07:17:24 slot1: Changing state: Unclaimed -> Owner
01/14/26 07:17:24 slot1: State change: IS_OWNER is false
01/14/26 07:17:24 slot1: Changing state: Owner -> Unclaimed

Obviously, I'll ask the user(s) to apply a floor function to be safe, but would it be possible to simply use ceil() within Condor to mitigate this as well? Or is there a knob to change this within our configuration which I failed to find?

Cheers

Carsten


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/

--
Tim Theisen (he, him, his)
Release Manager
Center for High Throughput Computing
University of Wisconsin - Madison
3695 Morgridge Hall
1205 University Ave
Madison, WI 53706