[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] GPU memory usage calculation



Hello Experts,

Need some help to understandÂhow GPUsMemoryUsage and GPUsAverageUsage are calculated.Â

I have a machine with 10GPUs partitioned into multiple dynamic slots. Two running jobs on this machine are using 3 GPUSs from slot4. I understand metrics started with GPUs* are advertised in job definition as they are per job metrics, DeviceGPUs* metrics are not aware of job.Â


# condor_status `hostname` -af:h Name gpus DeviceGPUsAverageUsage GPUsAverageUsage DeviceGPUsMemoryPeakUsage GPUsMemoryUsage
Name                       Âgpus DeviceGPUsAverageUsage GPUsAverageUsage   ÂDeviceGPUsMemoryPeakUsage GPUsMemoryUsage
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  4  Â0.0          Âundefined       484.0           484.0
slot2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  1  Â0.02436906948889096  Âundefined       37997.0          37997.0
slot3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  1  Â0.2557467106519156   undefined       95289.0          95289.0
slot4@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  0  Â0.6420998963701204   1.165656256588941   95695.0          484.0
slot4_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 2 Â Â0.4771762084666735 Â Â 0.5333874526797767 Â Â90981.0 Â Â Â Â Â Â Â Â Â 89675.0
slot4_2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 1 Â Â0.164923687903447 Â Â Â0.0 Â Â Â Â Â Â Â Â Â 95695.0 Â Â Â Â Â Â Â Â Â 484.0
slot5@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  1  Â0.02649657882417479  Âundefined       90909.0          90909.0


Version: 9.0.17

I read about the improvements in new versions, not sure whether these improvements are related to my query or not.Â

Questions:

- Why the value ofÂGPUsAverageUsage is 1.16 for slot4, slot4 itself doesn't run any job? It's also not a combination of GPUsAverageUsage on slot4_1 and slot4_2.Â

- WhyÂGPUsMemoryUsage is equivalent toÂDeviceGPUsMemoryPeakUsage for the slots which are not in use, it could be undefined likeÂGPUsAverageUsage?

- What'sÂUptimeGPUsSecondsAverageUsage couldn't find any information about this parameter?Â


Thanks & Regards,
Vikrant Aggarwal