Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Offline Docker Universe and faulty gpu

Date: Wed, 18 Mar 2026 20:29:48 +0000
From: Dudu Handelman <duduhandelman@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Offline Docker Universe and faulty gpu

Thanks Greg.

I haven't tried but I'm sure that job without gpu will work. I personally think that we need to bring down the docker universe on every problem with a driver that not responding nfs, gpu or whatever. It will usually happen when the user will remove the job.

We can use different approach and test the nvidia driver maybe using the same script that collect the performance metrics.

But it's tricky because the nvidia-smi will never return.

Basically, this is the issue with the driver. We need to make sure that server uptime is bellow 66 days. -:)

https://github.com/NVIDIA/open-gpu-kernel-modules/issues/971

Thx

David

Get Outlook for Android

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Greg Thain via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Wednesday, March 18, 2026 10:12:50 PM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Cc: Greg Thain <gthain@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Offline Docker Universe and faulty gpu

On 3/13/26 00:15, Dudu Handelman wrote:
> Hi all,
> We have observed several cases of Docker containers hanging while
> still occupying all GPUs on a server. Despite this, new containers are
> able to start and use the same GPUs.
> The scenario appears to be the following: a user removes a job, but
> for some reason the NVIDIA driver becomes stuck. As a result, the
> starter process times out while trying to remove or stop the
> container. Because of this timeout, the Docker universe is not marked
> as offline.
>

Thanks, David:

In this case can a docker job that does not need the GPU work? Or are
all docker jobs stuck?

-greg

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/

References:
- [HTCondor-users] Offline Docker Universe and faulty gpu
  - From: Dudu Handelman
- Re: [HTCondor-users] Offline Docker Universe and faulty gpu
  - From: Greg Thain

Prev by Date: Re: [HTCondor-users] Offline Docker Universe and faulty gpu
Next by Date: [HTCondor-users] condor_q -g not working for users
Previous by thread: Re: [HTCondor-users] Offline Docker Universe and faulty gpu
Next by thread: [HTCondor-users] Load Balancer for AP(submitter)
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Offline Docker Universe and faulty gpu