Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Offline Docker Universe and faulty gpu
- Date: Wed, 18 Mar 2026 15:12:05 -0500
- From: Greg Thain <gthain@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Offline Docker Universe and faulty gpu
On 3/13/26 00:15, Dudu Handelman wrote:
Hi all,
We have observed several cases of Docker containers hanging while
still occupying all GPUs on a server. Despite this, new containers are
able to start and use the same GPUs.
The scenario appears to be the following: a user removes a job, but
for some reason the NVIDIA driver becomes stuck. As a result, the
starter process times out while trying to remove or stop the
container. Because of this timeout, the Docker universe is not marked
as offline.
Thanks, David:
In this case can a docker job that does not need the GPU work? Or are
all docker jobs stuck?
-greg