[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] GPUsAverageUsage not set/wrong value
- Date: Fri, 12 Jun 2026 09:21:29 -0500 (CDT)
- From: Todd L Miller <tlmiller@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] GPUsAverageUsage not set/wrong value
We are seeing sometimes strange behavoir with the GPUsAverageUsage of a job.
It ends up not beeing set and stays on "undefined" if queried. While quering
the slot it self that returns the usage value correct.
Interesting. If I recall correctly, HTCondor detects GPU usage by
device, and then accumulates that usage in the slot using that device.
The last step is to assign the slot's usage -- from when the job began --
to the job. I can't presently imagine a reason for the last step to not
happen.
Could you describe the jobs for which you're seeing this problem
in some detail? (Do they use more than one GPU? Are they
container-universe jobs? How long do they run for? Are they running on
glide-ins or on EPs started with root privileges?)
Plus that it seems also that the GPUsAverageUsage value is sometimes
completly off to the actual expected usage value.
In these cases, do the per-slot numbers look sane?
I have the feeling somehow the calculations for this value are not fully
correct. Any ideas or pointers where to look to debug this behavior more
clearly?
Unfortunately, the implementation is rather more complicated than
has proved to be worthwhile; I don't know that a copy of a representative
job ad would help, but it certainly wouldn't hurt.
It may also be instructive to check, in the job event log (if
any), what the report GPUs usage at the end of the run looks like.
-- ToddM