[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Offline Docker Universe and faulty gpu



Thanks Greg. 
I haven't tried but I'm sure that job without gpu will work. I personally think that we need to bring down the docker universe on every problem with a driver that not responding nfs, gpu  or whatever.  It will usually happen when the user will remove the job. 
We can use different approach and test the nvidia driver maybe using the same script that collect the performance metrics. 
But it's tricky because the nvidia-smi will never return. 

Basically, this is the issue with the driver.  We need to make sure that server uptime is bellow 66 days. -:) 
 https://github.com/NVIDIA/open-gpu-kernel-modules/issues/971

Thx
David




Get Outlook for Android


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Greg Thain via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Wednesday, March 18, 2026 10:12:50 PM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Cc: Greg Thain <gthain@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Offline Docker Universe and faulty gpu

On 3/13/26 00:15, Dudu Handelman wrote:
> Hi all,
> We have observed several cases of Docker containers hanging while
> still occupying all GPUs on a server. Despite this, new containers are
> able to start and use the same GPUs.
> The scenario appears to be the following: a user removes a job, but
> for some reason the NVIDIA driver becomes stuck. As a result, the
> starter process times out while trying to remove or stop the
> container. Because of this timeout, the Docker universe is not marked
> as offline.
>

Thanks, David:

In this case can a docker job that does not need the GPU work? Or are
all docker jobs stuck?

-greg

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/