[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Strange behavior with GPU match making in 24.0.1



A STARTD that is older than 24.0 does not handle the new first-class 
gpus submit commands like gpus_minimum_memory correctly, so jobs will match to slots that exist, but it will fail to create a new dynamic slot when the job is using one of those commands. 

The fix is to upgrade your execute nodes. 

-tj


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Vikrant Aggarwal <ervikrant06@xxxxxxxxx>
Sent: Friday, November 8, 2024 10:29 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Strange behavior with GPU match making in 24.0.1
 
Hello Experts,

$ condor_version
$CondorVersion: 24.0.1 2024-10-31 BuildID: 765651 PackageID: 24.0.1-1 GitSHA: 20d0a0bb $
$CondorPlatform: x86_64_AlmaLinux9 $

Job submit file contains 

request_GPUs = 1
gpus_minimum_memory = 6G
queue 1

requirement able to match the slot. 

The Requirements _expression_ for job 19.000 is

    (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) &&
    (countMatches(MY.RequireGPUs,TARGET.AvailableGPUs) >= RequestGPUs) && (TARGET.HasFileTransfer)

Job 19.000 defines the following attributes:

    DiskUsage = 37
    GPUsMinMemory = 6144
    RequestDisk = undefined (kb)
    RequestGPUs = 1
    RequestMemory = 2000 (mb)
    RequireGPUs = GlobalMemoryMb >= GPUsMinMemory

The Requirements _expression_ for job 19.000 reduces to these conditions:

         Slots
Step    Matched  Condition
-----  --------  ---------
[0]          32  TARGET.Arch == "X86_64"
[1]          32  TARGET.OpSys == "LINUX"
[3]          32  TARGET.Disk >= RequestDisk
[5]          32  TARGET.Memory >= RequestMemory
[7]           1  countMatches(MY.RequireGPUs,TARGET.AvailableGPUs) >= RequestGPUs

No successful match recorded.
Last failed match: Fri Nov  8 11:17:49 2024

Reason for last match failure: no match found

Job stays in idle status until I submit a new job with the following condition. As soon as I submit the job without gpus_minimum_memory both jobs starts running new one and old one submitted with gpus_minimum_memory otherwise the job keeps on staying in idle status. Everytime reproducible. 

request_GPUs = 1
queue 1

partial output from the machine classad this machine is with 8GPUs. 

$ condor_status -compact `hostname` -l | grep -i global
GPUs_GlobalMemoryMb = 95346
GPUs_GPU_xxx = [ Id = "GPU-xxx"; ClockMhz = 1785.0; Capability = 9.0; CoresPerCU = 128; DeviceName = "NVIDIA H100 NVL"; DeviceUuid = "xxx-xx-xxx-xxx"; ECCEnabled = true; ComputeUnits = 132; DriverVersion = 12.7; DevicePciBusId = "0000:C1:00.0"; GlobalMemoryMb = 95346; MaxSupportedVersion = 12070 ]

Behavior reproducible with other properties like gpus_minimum_capability


Thanks & Regards,
Vikrant Aggarwal