[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Offline Docker Universe and faulty gpu



Hi all,
We have observed several cases of Docker containers hanging while still occupying all GPUs on a server. Despite this, new containers are able to start and use the same GPUs.
The scenario appears to be the following: a user removes a job, but for some reason the NVIDIA driver becomes stuck. As a result, the starter process times out while trying to remove or stop the container. Because of this timeout, the Docker universe is not marked as offline.

Thanks 
David