Re: [HTCondor-users] GPU memory usage calculation

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Date: Wed, 9 Oct 2024 15:00:32 +0000

From: John M Knoeller <johnkn@xxxxxxxxxxx>

Subject: Re: [HTCondor-users] GPU memory usage calculation

All of the GPUs usage numbers are approximate. This is because we have to get the usage by polling the NVIDIA driver

to get information for each GPU device. The driver does not provide usage by process. so we take the usage by device and assign it to the slot that the GPU is currently assigned to.

The polling interval is 10 seconds, so there can be a few seconds of usage on a device due to a job running that is assigned to the partitionable slot rather than the dynamic slot.

Also for some GPUS, initializing the monitoring results in GPU usage, which would in most cases be attributed to the partitionable slot.

-tj

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Vikrant Aggarwal <ervikrant06@xxxxxxxxx>
Sent: Monday, October 7, 2024 5:12 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] GPU memory usage calculation

Hello Experts,

Need some help to understand how GPUsMemoryUsage and GPUsAverageUsage are calculated.

I have a machine with 10GPUs partitioned into multiple dynamic slots. Two running jobs on this machine are using 3 GPUSs from slot4. I understand metrics started with GPUs* are advertised in job definition as they are per job metrics, DeviceGPUs* metrics are not aware of job.

# condor_status `hostname` -af:h Name gpus DeviceGPUsAverageUsage GPUsAverageUsage DeviceGPUsMemoryPeakUsage GPUsMemoryUsage

Name                                              gpus DeviceGPUsAverageUsage GPUsAverageUsage      DeviceGPUsMemoryPeakUsage GPUsMemoryUsage

slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   4    0.0                    undefined             484.0                     484.0

slot2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   1    0.02436906948889096    undefined             37997.0                   37997.0

slot3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   1    0.2557467106519156     undefined             95289.0                   95289.0

slot4@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   0    0.6420998963701204     1.165656256588941     95695.0                   484.0

slot4_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 2    0.4771762084666735     0.5333874526797767    90981.0                   89675.0

slot4_2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 1    0.164923687903447      0.0                   95695.0                   484.0

slot5@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   1    0.02649657882417479    undefined             90909.0                   90909.0

Version: 9.0.17

I read about the improvements in new versions, not sure whether these improvements are related to my query or not.

Questions:

- Why the value of GPUsAverageUsage is 1.16 for slot4, slot4 itself doesn't run any job? It's also not a combination of GPUsAverageUsage on slot4_1 and slot4_2.

- Why GPUsMemoryUsage is equivalent to DeviceGPUsMemoryPeakUsage for the slots which are not in use, it could be undefined like GPUsAverageUsage?

- What's UptimeGPUsSecondsAverageUsage couldn't find any information about this parameter?

Thanks & Regards,

Vikrant Aggarwal

Mailing List Archives

Authenticated access

Re: [HTCondor-users] GPU memory usage calculation