[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] GPUsMemoryUsage and GPUsAverageUsage missing from job classads



On 12/16/2024 4:56 PM, Vikrant Aggarwal wrote:
Hello All,

As per this presentation [1] specifically slide 8 mentioned that GPU stats are available in job classAD since version 8.8.5, we are using 9.0.17 version. These stats don't show up for all jobs but only for a few. 

Example: Job running for more than 2 days on a gpu node consuming gpu doesn't show any stat. 

From previous conversations: I understand we are polling NVIDIA drivers. Does it mean NVIDIA is not publishing the stats?  

Thanks & Regards,
Vikrant Aggarwal


Hi Vikrant,

HTCSS does read information out of nvidia and cuda libraries for usage information.  One suggestion is on the gpu server where the job has been running for two days, login and run the "condor_gpu_utilization" tool.  This tool is typically found in /usr/libexec/condor/condor_gpu_utilization, but you can run "condor_config_val gpu_monitor" to get the exact path.  When you run condor_gpu_utilization, it should output information every few minutes as it periodically polls the GPUs on the machine (it may take a couple minutes until the first poll) .... check to see if it reports any errors such as missing libraries etc.

However, and this is probably not the answer you wish to hear, HTCSS version 9 is out of date and no longer supported. Bugs related to GPUs have been fixed in the years since version 9; for instance, a bug was fixed long ago where GPU stats disappeared after issuing a condor_reconfig, and also support was added for servers that include different model GPUs installed into the same server. Also lots of things change with respect to GPUs, including the APIs from NVIDIA which seem to change in incompatible ways.  Running current GPU hardware (e.g. A100s with MIG features) and/or recent NVIDIA libraries often necessitates a new version of HTCSS.  Please consider updating to HTCSS :).

regards and happy holidays,
Todd