well that's not good.
Could you try running the 9.0.6 condor_gpu_discovery with
condor_gpu_discovery -verbose -diag
and send me the results?
thanks
-tj
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Carles Acosta <cacosta@xxxxxx>
Sent: Tuesday, September 28, 2021 11:53 PM To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> Subject: Re: [HTCondor-users] GPUs not detected in 9.0.6 version Hi Stuart, TJ,
Thank you for your replies.
Regarding the CUDA version, we did not update it. We are using old CUDA 10.1.243 for these GPUs.
We did some testing as TJ suggests. We are running right now HTCondor 9.0.6 but using the condor_gpu_discovery from HTCondor 9.0.5 and the GPU is correctly discovered:
# condor_status slot2@xxxxxxxxxxxx -af Gpus DetectedGpus CondorVersion
1 GPU-c659279d $CondorVersion: 9.0.6 Sep 23 2021 BuildID: 557184 PackageID: 9.0.6-1 $ Thus, it is related to the new condor_gpu_discovery binary in version 9.0.6. In fact:
[root@gpu03 ~]# /usr/libexec/condor/condor_gpu_discovery-9.0.5
DetectedGPUs="GPU-c659279d" [root@gpu03 ~]# /usr/libexec/condor/condor_gpu_discovery-9.0.6 Segmentation fault Sep 29 06:47:09 gpu03 kernel: condor_gpu_disc[22684]: segfault at 0 ip (null) sp 00007ffda4fe0088 error 14 in condor_gpu_discovery-9.0.6[400000+17000]
Thank you very much.
Cheers,
Carles
On Tue, 28 Sept 2021 at 23:24, John M Knoeller <johnkn@xxxxxxxxxxx> wrote:
Carles Acosta i Silva
PIC (Port d'Informació Científica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
Avís - Aviso - Legal Notice: http://legal.ifae.es
|