[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Offline Docker Universe and faulty gpu



On 3/13/26 00:15, Dudu Handelman wrote:
Hi all,
We have observed several cases of Docker containers hanging while still occupying all GPUs on a server. Despite this, new containers are able to start and use the same GPUs. The scenario appears to be the following: a user removes a job, but for some reason the NVIDIA driver becomes stuck. As a result, the starter process times out while trying to remove or stop the container. Because of this timeout, the Docker universe is not marked as offline.


Thanks, David:

In this case can a docker job that does not need the GPU work? Or are all docker jobs stuck?

-greg