[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] GPUsAverageUsage not set/wrong value



We are seeing sometimes strange behavoir with the GPUsAverageUsage of a job. It ends up not beeing set and stays on "undefined" if queried. While quering the slot it self that returns the usage value correct.

Interesting. If I recall correctly, HTCondor detects GPU usage by device, and then accumulates that usage in the slot using that device. The last step is to assign the slot's usage -- from when the job began -- to the job. I can't presently imagine a reason for the last step to not happen.

Could you describe the jobs for which you're seeing this problem in some detail? (Do they use more than one GPU? Are they container-universe jobs? How long do they run for? Are they running on glide-ins or on EPs started with root privileges?)

Plus that it seems also that the GPUsAverageUsage value is sometimes completly off to the actual expected usage value.

	In these cases, do the per-slot numbers look sane?

I have the feeling somehow the calculations for this value are not fully correct. Any ideas or pointers where to look to debug this behavior more clearly?

Unfortunately, the implementation is rather more complicated than has proved to be worthwhile; I don't know that a copy of a representative job ad would help, but it certainly wouldn't hurt.

It may also be instructive to check, in the job event log (if any), what the report GPUs usage at the end of the run looks like.

-- ToddM