On 3/13/26 00:15, Dudu Handelman wrote:
> Hi all,
> We have observed several cases of Docker containers hanging while
> still occupying all GPUs on a server. Despite this, new containers are
> able to start and use the same GPUs.
> The scenario appears to be the following: a user removes a job, but
> for some reason the NVIDIA driver becomes stuck. As a result, the
> starter process times out while trying to remove or stop the
> container. Because of this timeout, the Docker universe is not marked
> as offline.
>
Thanks, David:
In this case can a docker job that does not need the GPU work? Or are
all docker jobs stuck?
-greg
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
The archives can be found at:
https://www-auth.cs.wisc.edu/lists/htcondor-users/