[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] condor_gpu_discovery failing



Hello Experts,

condor_gpu_discovery failing:Â

# ldd /usr/libexec/condor/condor_gpu_discovery
    linux-vdso.so.1 (0x00007ffc7edd1000)
    libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007fb71ee00000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fb71f0be000)
    libc.so.6 => /lib64/libc.so.6 (0x00007fb71ea00000)
    libm.so.6 => /lib64/libm.so.6 (0x00007fb71ed25000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fb71f111000)


#Â/usr/libexec/condor/condor_gpu_discovery
Error: cuInit returned 802
DetectedGPUs=0

This machine has 8Â H200 SXM GPUs

>>> import torch; print(torch.cuda.device_count())
8

nvidia-smi works without any issue.

NVIDIA-SMI 570.124.06 Â Â Â Â Â Â Driver Version: 570.124.06 Â Â CUDA Version: 12.8


Any input is highly appreciated.


Thanks & Regards,
Vikrant Aggarwal