A STARTD that is older than 24.0 does not handle the new first-class
gpus submit commands like gpus_minimum_memory correctly, so jobs will match to slots that exist, but it will fail to create a new dynamic slot when the job is using one of those commands.
The fix is to upgrade your execute nodes.
-tj
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Vikrant Aggarwal <ervikrant06@xxxxxxxxx>
Sent: Friday, November 8, 2024 10:29 AM To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> Subject: [HTCondor-users] Strange behavior with GPU match making in 24.0.1 Hello Experts,
$ condor_version
$CondorVersion: 24.0.1 2024-10-31 BuildID: 765651 PackageID: 24.0.1-1 GitSHA: 20d0a0bb $ $CondorPlatform: x86_64_AlmaLinux9 $ Job submit file contains
request_GPUs = 1
gpus_minimum_memory = 6G
queue 1 requirement able to match the slot.
The Requirements _expression_ for job 19.000 is
(TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (countMatches(MY.RequireGPUs,TARGET.AvailableGPUs) >= RequestGPUs) && (TARGET.HasFileTransfer) Job 19.000 defines the following attributes: DiskUsage = 37 GPUsMinMemory = 6144 RequestDisk = undefined (kb) RequestGPUs = 1 RequestMemory = 2000 (mb) RequireGPUs = GlobalMemoryMb >= GPUsMinMemory The Requirements _expression_ for job 19.000 reduces to these conditions: Slots Step Matched Condition ----- -------- --------- [0] 32 TARGET.Arch == "X86_64" [1] 32 TARGET.OpSys == "LINUX" [3] 32 TARGET.Disk >= RequestDisk [5] 32 TARGET.Memory >= RequestMemory [7] 1 countMatches(MY.RequireGPUs,TARGET.AvailableGPUs) >= RequestGPUs No successful match recorded. Last failed match: Fri Nov 8 11:17:49 2024 Reason for last match failure: no match found Job stays in idle status until I submit a new job with the following condition. As soon as I submit the job without
gpus_minimum_memory both jobs starts running new one and old one submitted with
gpus_minimum_memory otherwise the job keeps on staying in idle status. Everytime reproducible.
request_GPUs = 1
queue 1
partial output from the machine classad this machine is with 8GPUs.
$ condor_status -compact `hostname` -l | grep -i global
GPUs_GlobalMemoryMb = 95346 GPUs_GPU_xxx = [ Id = "GPU-xxx"; ClockMhz = 1785.0; Capability = 9.0; CoresPerCU = 128; DeviceName = "NVIDIA H100 NVL"; DeviceUuid = "xxx-xx-xxx-xxx"; ECCEnabled = true; ComputeUnits = 132; DriverVersion = 12.7; DevicePciBusId = "0000:C1:00.0"; GlobalMemoryMb = 95346; MaxSupportedVersion = 12070 ] Behavior reproducible with other properties like
gpus_minimum_capability
Thanks & Regards,
Vikrant Aggarwal
|