[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] GPUs not detected in 9.0.6 version



Hello all,

On one side, Stuart and Steffen's tests do not reproduceÂmy issue. On the other side, there is no segfault for our machines using cuda 11 and GeForce RTX 2080 Ti and Testla V100. So, I've updated Cuda from 10.1 to 11 and, voilÃ, there is no segfault anymore using the condor_gpu_discovery. In conclusion, theÂcondor_gpu_discovery segfault for version 9.0.6 seems to be related to the CUDA version < 11.Â

Thank you very much.

Cheers,

Carles

On Thu, 30 Sept 2021 at 09:58, Steffen Grunewald <steffen.grunewald@xxxxxxxxxx> wrote:
On Wed, 2021-09-29 at 14:53:39 +0000, John M Knoeller wrote:
> well that's not good.
>
> Could you try running the 9.0.6 condor_gpu_discovery with
>
>Â Âcondor_gpu_discovery -verbose -diag
>
> and send me the results?
>
> thanks
> -tj
>
> ________________________________
> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Carles Acosta <cacosta@xxxxxx>
> Sent: Tuesday, September 28, 2021 11:53 PM
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Subject: Re: [HTCondor-users] GPUs not detected in 9.0.6 version
>
> Hi Stuart, TJ,
>
> Thank you for your replies.
>
> Regarding the CUDA version, we did not update it. We are using old CUDA 10.1.243 for these GPUs.
>
> We did some testing as TJ suggests. We are running right now HTCondor 9.0.6 but using the condor_gpu_discovery from HTCondor 9.0.5 and the GPU is correctly discovered:
>
> # condor_status slot2@xxxxxxxxxxxx<mailto:slot2@xxxxxxxxxxxx> -af Gpus DetectedGpus CondorVersion
> 1 GPU-c659279d $CondorVersion: 9.0.6 Sep 23 2021 BuildID: 557184 PackageID: 9.0.6-1 $
>
> Thus, it is related to the new condor_gpu_discovery binary in version 9.0.6. In fact:
>
> [root@gpu03 ~]# /usr/libexec/condor/condor_gpu_discovery-9.0.5
> DetectedGPUs="GPU-c659279d"
> [root@gpu03 ~]# /usr/libexec/condor/condor_gpu_discovery-9.0.6
> Segmentation fault

Hi John, all,

since I'm building my own set of packages, I extracted the condor_gpu_discovery binaries

-rwxr-xr-x 1 root root 60040 Oct 23 2020 condor_gpu_discovery-8.8.11
-rwxr-xr-x 1 root root 60040 Aug 2 15:54 condor_gpu_discovery-8.8.15
-rwxr-xr-x 1 root root 88800 Aug 2 17:02 condor_gpu_discovery-9.0.4
-rwxr-xr-x 1 root root 88800 Aug 20 13:10 condor_gpu_discovery-9.0.5
-rwxr-xr-x 1 root root 105216 Sep 24 11:01 condor_gpu_discovery-9.0.6
-rwxr-xr-x 1 root root 96992 Aug 20 14:02 condor_gpu_discovery-9.1.3
-rwxr-xr-x 1 root root 109312 Sep 24 11:36 condor_gpu_discovery-9.2.0

and ran them on a Debian Buster machine equipped with two Kepler K10s:

condor_gpu_discovery-8.8.11
DetectedGPUs="CUDA0, CUDA1"
condor_gpu_discovery-8.8.15
DetectedGPUs="CUDA0, CUDA1"
condor_gpu_discovery-9.0.4
DetectedGPUs="GPU-a2ac647a, GPU-18cd56a0"
condor_gpu_discovery-9.0.5
DetectedGPUs="GPU-a2ac647a, GPU-18cd56a0"
condor_gpu_discovery-9.0.6
DetectedGPUs="GPU-a2ac647a, GPU-18cd56a0"
condor_gpu_discovery-9.1.3
DetectedGPUs="GPU-a2ac647a, GPU-18cd56a0"
condor_gpu_discovery-9.2.0
DetectedGPUs="GPU-a2ac647a, GPU-18cd56a0"

No segfault at all.

Now an error that cannot be reproduced wouldn't help you much...
so I took the prebuilt Buster packages and ran the same:

-rwxr-xr-x 1 root root 60040 Jul 29 19:36 condor_gpu_discovery-8.8.15
-rwxr-xr-x 1 root root 84680 Mar 30 2021 condor_gpu_discovery-8.9.13
-rwxr-xr-x 1 root root 88800 Jul 29 18:01 condor_gpu_discovery-9.0.4
-rwxr-xr-x 1 root root 88800 Aug 18 21:25 condor_gpu_discovery-9.0.5
-rwxr-xr-x 1 root root 105216 Sep 23 17:09 condor_gpu_discovery-9.0.6
-rwxr-xr-x 1 root root 96992 Aug 19 21:27 condor_gpu_discovery-9.1.3
-rwxr-xr-x 1 root root 109312 Sep 23 23:37 condor_gpu_discovery-9.2.0

condor_gpu_discovery-8.8.15
DetectedGPUs="CUDA0, CUDA1"
condor_gpu_discovery-8.9.13
DetectedGPUs="CUDA0, CUDA1"
condor_gpu_discovery-9.0.4
DetectedGPUs="GPU-a2ac647a, GPU-18cd56a0"
condor_gpu_discovery-9.0.5
DetectedGPUs="GPU-a2ac647a, GPU-18cd56a0"
condor_gpu_discovery-9.0.6
DetectedGPUs="GPU-a2ac647a, GPU-18cd56a0"
condor_gpu_discovery-9.1.3
DetectedGPUs="GPU-a2ac647a, GPU-18cd56a0"
condor_gpu_discovery-9.2.0
DetectedGPUs="GPU-a2ac647a, GPU-18cd56a0"

- no segfault with Debian Buster. I'm suspecting a shared library issue...

Curious to learn about the actual culprit ;)
Â- Steffen


--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am MÃhlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
http://www.pic.esÂ
AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es