Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Offline Docker Universe and faulty gpu
- Date: Fri, 13 Mar 2026 05:15:53 +0000
- From: Dudu Handelman <duduhandelman@xxxxxxxxxxx>
- Subject: [HTCondor-users] Offline Docker Universe and faulty gpu
Hi all,
We have observed several cases of Docker containers hanging while still occupying all GPUs on a server. Despite this, new containers are able to start and use the same GPUs.
The scenario appears to be the following: a user removes a job, but for some reason the NVIDIA driver becomes stuck. As a result, the starter process times out while trying to remove or stop the container. Because of this timeout, the Docker universe is not
marked as offline.
Thanks
David