Dear all, We are currently adding some ARM (condor 10.0.9) machines to our cluster (condor 9.0.17). We noticed a strange behavior when we submit some arm jobs. They seem to fill the machine capacity up to ~50% of the available cores, even if the partitionable slot on that machine has plenty of resources to accept new jobs. After some troubleshooting we found out what it seems to be the cause: idle jobs don't match with the WithinResourceLimits _expression_. Running a reverse check on an idle job on the slot we get the following message: $ sudo condor_q -n ce02-htc 13177276 -reverse -machine slot1@xxxxxxxxxxxxxxxxxxxxxxxxx -- Schedd: ce02-htc.cr.cnaf.infn.it : <131.154.192.41:9618?... -- Slot: slot1@xxxxxxxxxxxxxxxxxxxxxxxxx : Analyzing matches for 1 Jobs in 1 autoclusters The Requirements _expression_ for this slot is START && (WithinResourceLimits) START is (StartJobs is true) && (TotalLoadAvg < 85) && TARGET.WantARM is true WithinResourceLimits is (ifThenElse(TARGET._cp_orig_RequestCpus isnt undefined,TARGET.RequestCpus <= MY.Cpus,MY.ConsumptionCpus <= MY.Cpus) && ifThenElse(TARGET._cp_orig_RequestDisk isnt undefined,TARGET.RequestDisk <= MY.Disk,MY.ConsumptionDisk <= MY.Disk) && ifThenElse(TARGET._cp_orig_RequestMemory isnt undefined,TARGET.RequestMemory <= MY.Memory,MY.ConsumptionMemory <= MY.Memory)) This slot defines the following attributes: ConsumptionCpus = TARGET.RequestCpus ConsumptionDisk = quantize(target.RequestDisk,{ 1024 }) ConsumptionMemory = quantize(target.RequestMemory,{ 128 }) Cpus = 120 Disk = 2020164004 Memory = 19760 StartJobs = true && ( !t1_overheat) && (t1_mc_grace) t1_mc_grace = ((TARGET.RequestCpus > 1) || ((TARGET.RequestCpus == 1) && !(MC_GRACE ?: false))) t1_overheat = ((t1_Ambient_T ?: 20) > 30 || ((t1_Inlet_T ?: 40) > 50) || max({ t1_CPU1_T ?: 40,t1_CPU2_T ?: 40 }) > 85) ?: false TotalLoadAvg = 0.75 Job 13177276.0 has the following attributes: TARGET.RequestCpus = 8 TARGET.RequestDisk = 100 TARGET.RequestMemory = 32000 TARGET.WantARM = true The Requirements _expression_ for this slot reduces to these conditions: Clusters Step Matched Condition ----- -------- --------- [0] 1 StartJobs is true [3] 1 TARGET.WantARM is true [5] 0 WithinResourceLimits slot1@xxxxxxxxxxxxxxxxxxxxxxxxx: Run analysis summary of 1 jobs. 0 (0.00 %) match both slot and job requirements. 0 match the requirements of this slot. 1 have job requirements that match this slot. Checking the _expression_ with job and slot attributes we don't understand why this _expression_ returns false. Do you have some advice? Cheers, Alessandro |
Attachment:
smime.p7s
Description: S/MIME cryptographic signature