Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Strange behavior with GPU match making in 24.0.1

Date: Fri, 8 Nov 2024 19:46:10 +0000
From: John M Knoeller <johnkn@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Strange behavior with GPU match making in 24.0.1

A STARTD that is older than 24.0 does not handle the new first-class

gpus submit commands like gpus_minimum_memory correctly, so jobs will match to slots that exist, but it will fail to create a new dynamic slot when the job is using one of those commands.

The fix is to upgrade your execute nodes.

-tj

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Vikrant Aggarwal <ervikrant06@xxxxxxxxx>
Sent: Friday, November 8, 2024 10:29 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Strange behavior with GPU match making in 24.0.1

Hello Experts,

$ condor_version

$CondorVersion: 24.0.1 2024-10-31 BuildID: 765651 PackageID: 24.0.1-1 GitSHA: 20d0a0bb $

$CondorPlatform: x86_64_AlmaLinux9 $

Job submit file contains

request_GPUs = 1

gpus_minimum_memory = 6G

queue 1

requirement able to match the slot.

The Requirements _expression_ for job 19.000 is

    (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) &&

    (countMatches(MY.RequireGPUs,TARGET.AvailableGPUs) >= RequestGPUs) && (TARGET.HasFileTransfer)

Job 19.000 defines the following attributes:

    DiskUsage = 37

    GPUsMinMemory = 6144

    RequestDisk = undefined (kb)

    RequestGPUs = 1

    RequestMemory = 2000 (mb)

    RequireGPUs = GlobalMemoryMb >= GPUsMinMemory

The Requirements _expression_ for job 19.000 reduces to these conditions:

         Slots

Step    Matched  Condition

-----  --------  ---------

[0]          32  TARGET.Arch == "X86_64"

[1]          32  TARGET.OpSys == "LINUX"

[3]          32  TARGET.Disk >= RequestDisk

[5]          32  TARGET.Memory >= RequestMemory

[7]           1  countMatches(MY.RequireGPUs,TARGET.AvailableGPUs) >= RequestGPUs

No successful match recorded.

Last failed match: Fri Nov  8 11:17:49 2024

Reason for last match failure: no match found

Job stays in idle status until I submit a new job with the following condition. As soon as I submit the job without gpus_minimum_memory both jobs starts running new one and old one submitted with gpus_minimum_memory otherwise the job keeps on staying in idle status. Everytime reproducible.

request_GPUs = 1

queue 1

partial output from the machine classad this machine is with 8GPUs.

$ condor_status -compact `hostname` -l | grep -i global
GPUs_GlobalMemoryMb = 95346
GPUs_GPU_xxx = [ Id = "GPU-xxx"; ClockMhz = 1785.0; Capability = 9.0; CoresPerCU = 128; DeviceName = "NVIDIA H100 NVL"; DeviceUuid = "xxx-xx-xxx-xxx"; ECCEnabled = true; ComputeUnits = 132; DriverVersion = 12.7; DevicePciBusId = "0000:C1:00.0"; GlobalMemoryMb = 95346; MaxSupportedVersion = 12070 ]

Behavior reproducible with other properties like gpus_minimum_capability

Thanks & Regards,

Vikrant Aggarwal

Follow-Ups:
- Re: [HTCondor-users] Strange behavior with GPU match making in 24.0.1
  - From: Vikrant Aggarwal
- Re: [HTCondor-users] Strange behavior with GPU match making in 24.0.1
  - From: Anderson, Stuart B.

References:
- [HTCondor-users] Strange behavior with GPU match making in 24.0.1
  - From: Vikrant Aggarwal

Prev by Date: Re: [HTCondor-users] "condor_status -compact -gpus" doesn't accept the (-p pool option)
Next by Date: Re: [HTCondor-users] "condor_status -compact -gpus" doesn't accept the (-p pool option)
Previous by thread: [HTCondor-users] Strange behavior with GPU match making in 24.0.1
Next by thread: Re: [HTCondor-users] Strange behavior with GPU match making in 24.0.1
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Strange behavior with GPU match making in 24.0.1