All of the GPUs usage numbers are approximate. This is because we have to get the usage by polling the NVIDIA driver
to get information for each GPU device. The driver does not provide usage by process. so we take the usage by device and assign it to the slot that the GPU is currently assigned to.
The polling interval is 10 seconds, so there can be a few seconds of usage on a device due to a job running that is assigned to the partitionable slot rather than the dynamic slot.
Also for some GPUS, initializing the monitoring results in GPU usage, which would in most cases be attributed to the partitionable slot.
-tj
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Vikrant Aggarwal <ervikrant06@xxxxxxxxx>
Sent: Monday, October 7, 2024 5:12 PM To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> Subject: [HTCondor-users] GPU memory usage calculation Hello Experts,
Need some help to understand how GPUsMemoryUsage and GPUsAverageUsage are calculated.
I have a machine with 10GPUs partitioned into multiple dynamic slots. Two running jobs on this machine are using 3 GPUSs from slot4. I understand metrics started with GPUs* are advertised in job definition as they are per job metrics,
DeviceGPUs* metrics are not aware of job.
# condor_status `hostname` -af:h Name gpus DeviceGPUsAverageUsage GPUsAverageUsage DeviceGPUsMemoryPeakUsage GPUsMemoryUsage
Name gpus DeviceGPUsAverageUsage GPUsAverageUsage DeviceGPUsMemoryPeakUsage GPUsMemoryUsage slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 4 0.0 undefined 484.0 484.0 slot2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 1 0.02436906948889096 undefined 37997.0 37997.0 slot3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 1 0.2557467106519156 undefined 95289.0 95289.0 slot4@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 0 0.6420998963701204 1.165656256588941 95695.0 484.0 slot4_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 2 0.4771762084666735 0.5333874526797767 90981.0 89675.0 slot4_2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 1 0.164923687903447 0.0 95695.0 484.0 slot5@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 1 0.02649657882417479 undefined 90909.0 90909.0 Version: 9.0.17
I read about the improvements in new versions, not sure whether these improvements are related to my query or not.
Questions:
- Why the value of GPUsAverageUsage is 1.16 for slot4, slot4 itself doesn't run any job? It's also not a combination of GPUsAverageUsage on slot4_1 and slot4_2.
- Why GPUsMemoryUsage is equivalent to DeviceGPUsMemoryPeakUsage for the slots which are not in use, it could be undefined like GPUsAverageUsage?
- What's UptimeGPUsSecondsAverageUsage couldn't find any information about this parameter?
Thanks & Regards,
Vikrant Aggarwal
|