[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] GPUsAverageUsage not set/wrong value



Hi Todd,

On 6/12/26 16:21, Todd L Miller via HTCondor-users wrote:
We are seeing sometimes strange behavoir with the GPUsAverageUsage of a job. It ends up not beeing set and stays on "undefined" if queried. While quering the slot it self that returns the usage value correct.

ÂÂÂÂInteresting. If I recall correctly, HTCondor detects GPU usage by device, and then accumulates that usage in the slot using that device. The last step is to assign the slot's usage -- from when the job began -- to the job. I can't presently imagine a reason for the last step to not happen.
Same, can't really see why this would happen.

ÂÂÂÂCould you describe the jobs for which you're seeing this problem in some detail? (Do they use more than one GPU? Are they container- universe jobs? How long do they run for? Are they running on glide-ins or on EPs started with root privileges?)
Its vanilla universe with default container, all 1 GPU no glide-ins. Looking at the output of the gpu monitoring deamon on the EP it self, that looks fine.

Plus that it seems also that the GPUsAverageUsage value is sometimes completly off to the actual expected usage value.

 ÂÂÂÂIn these cases, do the per-slot numbers look sane?
Yes, the slot values always look correct.

I have the feeling somehow the calculations for this value are not fully correct. Any ideas or pointers where to look to debug this behavior more clearly?

ÂÂÂÂUnfortunately, the implementation is rather more complicated than has proved to be worthwhile; I don't know that a copy of a representative job ad would help, but it certainly wouldn't hurt.
My current guess is that if a job is restarted for some reason, some of the start/end times are not calculated correctly, so it ends up with a 0 value. Altho that does not explain why some of the job slots don't have any value attached to it. Given it does show a `UptimeGPUsSecondsAverageUsage = (UptimeGPUsSeconds - StartOfJobUptimeGPUsSeconds) / (LastUpdateUptimeGPUsSeconds - FirstUpdateUptimeGPUsSeconds)` So if that somehow ends up beeing a with a wrong UpdateUptimeGPUsSeconds,

ÂÂÂÂIt may also be instructive to check, in the job event log (if any), what the report GPUs usage at the end of the run looks like.
I can check with the users to see what that reports.

Emily