[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] GPUsAverageUsage not set/wrong value
- Date: Mon, 15 Jun 2026 13:43:25 +0200
- From: Emily Kooistra <a66@xxxxxxxxx>
- Subject: Re: [HTCondor-users] GPUsAverageUsage not set/wrong value
Hi Todd,
On 6/12/26 16:21, Todd L Miller via HTCondor-users wrote:
We are seeing sometimes strange behavoir with the GPUsAverageUsage of
a job. It ends up not beeing set and stays on "undefined" if queried.
While quering the slot it self that returns the usage value correct.
ÂÂÂÂInteresting. If I recall correctly, HTCondor detects GPU usage by
device, and then accumulates that usage in the slot using that device.
The last step is to assign the slot's usage -- from when the job began
-- to the job. I can't presently imagine a reason for the last step to
not happen.
Same, can't really see why this would happen.
ÂÂÂÂCould you describe the jobs for which you're seeing this problem in
some detail? (Do they use more than one GPU? Are they container-
universe jobs? How long do they run for? Are they running on glide-ins
or on EPs started with root privileges?)
Its vanilla universe with default container, all 1 GPU no glide-ins.
Looking at the output of the gpu monitoring deamon on the EP it self,
that looks fine.
Plus that it seems also that the GPUsAverageUsage value is sometimes
completly off to the actual expected usage value.
ÂÂÂÂIn these cases, do the per-slot numbers look sane?
Yes, the slot values always look correct.
I have the feeling somehow the calculations for this value are not
fully correct. Any ideas or pointers where to look to debug this
behavior more clearly?
ÂÂÂÂUnfortunately, the implementation is rather more complicated than
has proved to be worthwhile; I don't know that a copy of a
representative job ad would help, but it certainly wouldn't hurt.
My current guess is that if a job is restarted for some reason, some of
the start/end times are not calculated correctly, so it ends up with a 0
value. Altho that does not explain why some of the job slots don't have
any value attached to it. Given it does show a
`UptimeGPUsSecondsAverageUsage = (UptimeGPUsSeconds -
StartOfJobUptimeGPUsSeconds) / (LastUpdateUptimeGPUsSeconds -
FirstUpdateUptimeGPUsSeconds)` So if that somehow ends up beeing a with
a wrong UpdateUptimeGPUsSeconds,
ÂÂÂÂIt may also be instructive to check, in the job event log (if any),
what the report GPUs usage at the end of the run looks like.
I can check with the users to see what that reports.
Emily