[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Strange behavior with GPU match making in 24.0.1



Hello John and Anderson,Â

Thanks for the response.Â

In this case sched and worker are the roles assigned on a single box running 24.0.1.Â

Negotiator is running 9.0.17, do you think Negotiator can face some challenges in matchmaking with these requirements?Â

Strangely, as mentioned earlier also when a job without anyÂgpus_minimum_memory is submitted, a queued job withÂgpus_minimum_memory also starts running.Â


Thanks & Regards,
Vikrant Aggarwal

On Fri, Nov 8, 2024 at 2:46âPM John M Knoeller via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
A STARTD that is older than 24.0 does not handle the new first-classÂ
gpus submit commands like gpus_minimum_memory correctly, so jobs will match to slots that exist, but it will fail to create a new dynamic slot when the job is using one of those commands.Â

The fix is to upgrade your execute nodes.Â

-tj


From:ÂHTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Vikrant Aggarwal <ervikrant06@xxxxxxxxx>
Sent:ÂFriday, November 8, 2024 10:29 AM
To:ÂHTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject:Â[HTCondor-users] Strange behavior with GPU match making in 24.0.1
Â
Hello Experts,

$ condor_version
$CondorVersion: 24.0.1 2024-10-31 BuildID: 765651 PackageID: 24.0.1-1 GitSHA: 20d0a0bb $
$CondorPlatform: x86_64_AlmaLinux9 $

Job submit file containsÂ

request_GPUs = 1
gpus_minimum_memory = 6G
queue 1

requirement able to match the slot.Â

The Requirements _expression_ for job 19.000 is

  (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) &&
  (countMatches(MY.RequireGPUs,TARGET.AvailableGPUs) >= RequestGPUs) && (TARGET.HasFileTransfer)

Job 19.000 defines the following attributes:

  DiskUsage = 37
  GPUsMinMemory = 6144
  RequestDisk = undefined (kb)
  RequestGPUs = 1
  RequestMemory = 2000 (mb)
  RequireGPUs = GlobalMemoryMb >= GPUsMinMemory

The Requirements _expression_ for job 19.000 reduces to these conditions:

    ÂSlots
Step  ÂMatched ÂCondition
----- Â-------- Â---------
[0] Â Â Â Â Â32 ÂTARGET.Arch == "X86_64"
[1] Â Â Â Â Â32 ÂTARGET.OpSys == "LINUX"
[3] Â Â Â Â Â32 ÂTARGET.Disk >= RequestDisk
[5] Â Â Â Â Â32 ÂTARGET.Memory >= RequestMemory
[7] Â Â Â Â Â 1 ÂcountMatches(MY.RequireGPUs,TARGET.AvailableGPUs) >= RequestGPUs

No successful match recorded.
Last failed match: Fri Nov Â8 11:17:49 2024

Reason for last match failure: no match found

Job stays in idle status until I submit a new job with the following condition. As soon as I submit the job without gpus_minimum_memoryÂboth jobs starts running new one and old one submitted with gpus_minimum_memoryÂotherwise the job keeps on staying in idle status. Everytime reproducible.Â

request_GPUs = 1
queue 1

partial output from the machine classad this machine is with 8GPUs.Â

$ condor_status -compact `hostname` -l | grep -i global
GPUs_GlobalMemoryMb = 95346
GPUs_GPU_xxx = [ Id = "GPU-xxx"; ClockMhz = 1785.0; Capability = 9.0; CoresPerCU = 128; DeviceName = "NVIDIA H100 NVL"; DeviceUuid = "xxx-xx-xxx-xxx"; ECCEnabled = true; ComputeUnits = 132; DriverVersion = 12.7; DevicePciBusId = "0000:C1:00.0"; GlobalMemoryMb = 95346; MaxSupportedVersion = 12070 ]

Behavior reproducible with other properties like gpus_minimum_capability


Thanks & Regards,
Vikrant Aggarwal
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/