[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Strange behavior with GPU match making in 24.0.1



Hello Experts,

$ condor_version
$CondorVersion: 24.0.1 2024-10-31 BuildID: 765651 PackageID: 24.0.1-1 GitSHA: 20d0a0bb $
$CondorPlatform: x86_64_AlmaLinux9 $


Job submit file containsÂ

request_GPUs = 1
gpus_minimum_memory = 6G
queue 1


requirement able to match the slot.Â

The Requirements _expression_ for job 19.000 is

  (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) &&
  (countMatches(MY.RequireGPUs,TARGET.AvailableGPUs) >= RequestGPUs) && (TARGET.HasFileTransfer)

Job 19.000 defines the following attributes:

  DiskUsage = 37
  GPUsMinMemory = 6144
  RequestDisk = undefined (kb)
  RequestGPUs = 1
  RequestMemory = 2000 (mb)
  RequireGPUs = GlobalMemoryMb >= GPUsMinMemory

The Requirements _expression_ for job 19.000 reduces to these conditions:

    ÂSlots
Step  ÂMatched ÂCondition
----- Â-------- Â---------
[0] Â Â Â Â Â32 ÂTARGET.Arch == "X86_64"
[1] Â Â Â Â Â32 ÂTARGET.OpSys == "LINUX"
[3] Â Â Â Â Â32 ÂTARGET.Disk >= RequestDisk
[5] Â Â Â Â Â32 ÂTARGET.Memory >= RequestMemory
[7] Â Â Â Â Â 1 ÂcountMatches(MY.RequireGPUs,TARGET.AvailableGPUs) >= RequestGPUs

No successful match recorded.
Last failed match: Fri Nov Â8 11:17:49 2024

Reason for last match failure: no match found


Job stays in idle status until I submit a new job with the following condition. As soon as I submit the job withoutÂgpus_minimum_memory both jobs starts running new one and old one submitted withÂgpus_minimum_memory otherwise the job keeps on staying in idle status. Everytime reproducible.Â

request_GPUs = 1
queue 1

partial output from the machine classad this machine is with 8GPUs.Â

$ condor_status -compact `hostname` -l | grep -i global
GPUs_GlobalMemoryMb = 95346
GPUs_GPU_xxx = [ Id = "GPU-xxx"; ClockMhz = 1785.0; Capability = 9.0; CoresPerCU = 128; DeviceName = "NVIDIA H100 NVL"; DeviceUuid = "xxx-xx-xxx-xxx"; ECCEnabled = true; ComputeUnits = 132; DriverVersion = 12.7; DevicePciBusId = "0000:C1:00.0"; GlobalMemoryMb = 95346; MaxSupportedVersion = 12070 ]

Behavior reproducible with other properties likeÂgpus_minimum_capability


Thanks & Regards,
Vikrant Aggarwal