[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] GPUsAverageUsage not set/wrong value

Date: Mon, 15 Jun 2026 13:43:25 +0200
From: Emily Kooistra <a66@xxxxxxxxx>
Subject: Re: [HTCondor-users] GPUsAverageUsage not set/wrong value

Hi Todd,

On 6/12/26 16:21, Todd L Miller via HTCondor-users wrote:

We are seeing sometimes strange behavoir with the GPUsAverageUsage ofa job. It ends up not beeing set and stays on "undefined" if queried.While quering the slot it self that returns the usage value correct.
ÂÂÂÂInteresting.Â If I recall correctly, HTCondor detects GPU usage bydevice, and then accumulates that usage in the slot using that device.The last step is to assign the slot's usage -- from when the job began-- to the job.Â I can't presently imagine a reason for the last step tonot happen.

Same, can't really see why this would happen.

ÂÂÂÂCould you describe the jobs for which you're seeing this problem insome detail?Â (Do they use more than one GPU?Â Are they container-universe jobs?Â How long do they run for?Â Are they running on glide-insor on EPs started with root privileges?)

Its vanilla universe with default container, all 1 GPU no glide-ins.Looking at the output of the gpu monitoring deamon on the EP it self,that looks fine.

Plus that it seems also that the GPUsAverageUsage value is sometimescompletly off to the actual expected usage value.
 ÂÂÂÂIn these cases, do the per-slot numbers look sane?

Yes, the slot values always look correct.

I have the feeling somehow the calculations for this value are notfully correct.Â Any ideas or pointers where to look to debug thisbehavior more clearly?
ÂÂÂÂUnfortunately, the implementation is rather more complicated thanhas proved to be worthwhile; I don't know that a copy of arepresentative job ad would help, but it certainly wouldn't hurt.

My current guess is that if a job is restarted for some reason, some ofthe start/end times are not calculated correctly, so it ends up with a 0value. Altho that does not explain why some of the job slots don't haveany value attached to it. Given it does show a`UptimeGPUsSecondsAverageUsage = (UptimeGPUsSeconds -StartOfJobUptimeGPUsSeconds) / (LastUpdateUptimeGPUsSeconds -FirstUpdateUptimeGPUsSeconds)` So if that somehow ends up beeing a witha wrong UpdateUptimeGPUsSeconds,

ÂÂÂÂIt may also be instructive to check, in the job event log (if any),what the report GPUs usage at the end of the run looks like.

I can check with the users to see what that reports.

Emily

References:
- [HTCondor-users] GPUsAverageUsage not set/wrong value
  - From: Emily Kooistra
- Re: [HTCondor-users] GPUsAverageUsage not set/wrong value
  - From: Todd L Miller

Prev by Date: Re: [HTCondor-users] Fractional CPU resources possible?
Next by Date: Re: [HTCondor-users] Fractional CPU resources possible?
Previous by thread: Re: [HTCondor-users] GPUsAverageUsage not set/wrong value
Next by thread: [HTCondor-users] HTCondor workshop autumn 2026 in Lyon - registration is now open
Index(es):
- Date
- Thread