I am thinking that to adapt the code to work with any cuda version, the easiest way is to just use the cuda driver API and not the runtime api.
the driver API appears to be much more stable across versions. As such, I think you would be better off removing your symlink for cudart.dll and adding the registry key so that gpu_discovery gets the runtime version from the registry. I am working on a fix for gpu_discovery and expect to have something I can send you to try out in a day or so. If it works for you then we can roll the fix into a future stable release.
-tj From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
On Behalf Of Jens Schmaler Ok, I digged a bit deeper into the code of condor_gpu_discovery. I believe the issue is that it tries to load “cudart.dll” specifically while the real name of that
dll on my system is “cudart64_100.dll” (and will be different for each CUDA version). Creating a symlink solved this issue, so that condor_gpu_discovery now also reports the correct run time version 10.0. This symlink was not created by the CUDA installer
for me at least. The reported memory is now totally off (CUDA0GlobalMemoryMb=4977051853851), in line with your finding that CUDA 10 seems to expect a different cudaDeviceProp structure
from the one that you have used. Indeed, when comparing your code to the latest CUDA headers, the struct has changed quite a bit. Not sure how to properly adapt your code such that It would work with any CUDA version. Ideas anyone? Thanks and best regards, Jens From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx]
On Behalf Of John M Knoeller This is what the key in question looks like for me since I updated to CUDA 10 > reg query "HKLM\software\NVIDIA Corporation\GPU Computing Toolkit\CUDA" HKEY_LOCAL_MACHINE\software\NVIDIA Corporation\GPU Computing Toolkit\CUDA FirstVersionInstalled REG_SZ v10.0 From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
On Behalf Of Jens Schmaler Thanks for clarifying! I have the dll in the same location as you, just the installer does not seem to have set the registry key. Which version of Windows are you
running? We have this issue on Win 10 and Win 2016. I will now try to set the key manually and check if it works then. Best, Jens From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx]
On Behalf Of John M Knoeller So I did some digging, and you were correct. condor_gpu_discovery on Windows does try and access this registry key "SOFTWARE\\NVIDIA Corporation\\GPU Computing Toolkit\\CUDA" when it cannot find cudart.dll. We don’t expect this code to execute most of the time, but it is there.
This key is created by the NVIDIA CUDA Toolkit installer for Windows. I upgraded my workstation to v10 from here https://developer.nvidia.com/cuda-downloads?target_os=Windows&target_arch=x86_64&target_version=10 And it updated the key HKEY_LOCAL_MACHINE\software\NVIDIA Corporation\GPU Computing Toolkit\CUDA so that it now says v10.0 On my workstation, the cuda runtime is installed in this directory: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin I’m curious as to where it is installed on your Windows machines, if not there.
-tj |