On Wed, 2021-09-29 at 14:53:39 +0000, John M Knoeller wrote:
> well that's not good.
>
> Could you try running the 9.0.6 condor_gpu_discovery with
>
>Â Âcondor_gpu_discovery -verbose -diag
>
> and send me the results?
>
> thanks
> -tj
>
> ________________________________
> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Carles Acosta <cacosta@xxxxxx>
> Sent: Tuesday, September 28, 2021 11:53 PM
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Subject: Re: [HTCondor-users] GPUs not detected in 9.0.6 version
>
> Hi Stuart, TJ,
>
> Thank you for your replies.
>
> Regarding the CUDA version, we did not update it. We are using old CUDA 10.1.243 for these GPUs.
>
> We did some testing as TJ suggests. We are running right now HTCondor 9.0.6 but using the condor_gpu_discovery from HTCondor 9.0.5 and the GPU is correctly discovered:
>
> # condor_status slot2@xxxxxxxxxxxx<mailto:slot2@xxxxxxxxxxxx> -af Gpus DetectedGpus CondorVersion
> 1 GPU-c659279d $CondorVersion: 9.0.6 Sep 23 2021 BuildID: 557184 PackageID: 9.0.6-1 $
>
> Thus, it is related to the new condor_gpu_discovery binary in version 9.0.6. In fact:
>
> [root@gpu03 ~]# /usr/libexec/condor/condor_gpu_discovery-9.0.5
> DetectedGPUs="GPU-c659279d"
> [root@gpu03 ~]# /usr/libexec/condor/condor_gpu_discovery-9.0.6
> Segmentation fault
Hi John, all,
since I'm building my own set of packages, I extracted the condor_gpu_discovery binaries
-rwxr-xr-x 1 root root 60040 Oct 23 2020 condor_gpu_discovery-8.8.11
-rwxr-xr-x 1 root root 60040 Aug 2 15:54 condor_gpu_discovery-8.8.15
-rwxr-xr-x 1 root root 88800 Aug 2 17:02 condor_gpu_discovery-9.0.4
-rwxr-xr-x 1 root root 88800 Aug 20 13:10 condor_gpu_discovery-9.0.5
-rwxr-xr-x 1 root root 105216 Sep 24 11:01 condor_gpu_discovery-9.0.6
-rwxr-xr-x 1 root root 96992 Aug 20 14:02 condor_gpu_discovery-9.1.3
-rwxr-xr-x 1 root root 109312 Sep 24 11:36 condor_gpu_discovery-9.2.0
and ran them on a Debian Buster machine equipped with two Kepler K10s:
condor_gpu_discovery-8.8.11
DetectedGPUs="CUDA0, CUDA1"
condor_gpu_discovery-8.8.15
DetectedGPUs="CUDA0, CUDA1"
condor_gpu_discovery-9.0.4
DetectedGPUs="GPU-a2ac647a, GPU-18cd56a0"
condor_gpu_discovery-9.0.5
DetectedGPUs="GPU-a2ac647a, GPU-18cd56a0"
condor_gpu_discovery-9.0.6
DetectedGPUs="GPU-a2ac647a, GPU-18cd56a0"
condor_gpu_discovery-9.1.3
DetectedGPUs="GPU-a2ac647a, GPU-18cd56a0"
condor_gpu_discovery-9.2.0
DetectedGPUs="GPU-a2ac647a, GPU-18cd56a0"
No segfault at all.
Now an error that cannot be reproduced wouldn't help you much...
so I took the prebuilt Buster packages and ran the same:
-rwxr-xr-x 1 root root 60040 Jul 29 19:36 condor_gpu_discovery-8.8.15
-rwxr-xr-x 1 root root 84680 Mar 30 2021 condor_gpu_discovery-8.9.13
-rwxr-xr-x 1 root root 88800 Jul 29 18:01 condor_gpu_discovery-9.0.4
-rwxr-xr-x 1 root root 88800 Aug 18 21:25 condor_gpu_discovery-9.0.5
-rwxr-xr-x 1 root root 105216 Sep 23 17:09 condor_gpu_discovery-9.0.6
-rwxr-xr-x 1 root root 96992 Aug 19 21:27 condor_gpu_discovery-9.1.3
-rwxr-xr-x 1 root root 109312 Sep 23 23:37 condor_gpu_discovery-9.2.0
condor_gpu_discovery-8.8.15
DetectedGPUs="CUDA0, CUDA1"
condor_gpu_discovery-8.9.13
DetectedGPUs="CUDA0, CUDA1"
condor_gpu_discovery-9.0.4
DetectedGPUs="GPU-a2ac647a, GPU-18cd56a0"
condor_gpu_discovery-9.0.5
DetectedGPUs="GPU-a2ac647a, GPU-18cd56a0"
condor_gpu_discovery-9.0.6
DetectedGPUs="GPU-a2ac647a, GPU-18cd56a0"
condor_gpu_discovery-9.1.3
DetectedGPUs="GPU-a2ac647a, GPU-18cd56a0"
condor_gpu_discovery-9.2.0
DetectedGPUs="GPU-a2ac647a, GPU-18cd56a0"
- no segfault with Debian Buster. I'm suspecting a shared library issue...
Curious to learn about the actual culprit ;)
Â- Steffen
--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am MÃhlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/