Hi All,
After upgrading some nodes to 24.02, no CUDA Devices will be
available. I used this python skript for testing:
####
import torch
print("CUDA available:", torch.cuda.is_available())
print("Number of GPUs:", torch.cuda.device_count())
print("CUDA version:", torch.version.cuda)
if torch.cuda.is_available():
ÂÂÂ for i in range(torch.cuda.device_count()):
ÂÂÂÂÂÂÂ print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
else:
ÂÂÂ print("No GPUs available.")
####
The job's output:
####
CUDA available: False Number of GPUs: 0 CUDA version: 12.4 No GPUs available.
#####
When using a node running 23.0.18 I got this:
####
CUDA available: True
Number of GPUs: 1
CUDA version: 12.4
GPU 0: NVIDIA L40S
#####
ClassAds are set correctly:
####
condor_status -l node1.local |grep 'CUDA\|GPU'
CUDACapability = 8.9
CUDAClockMhz = 2175.0
CUDAComputeUnits = 48
CUDACoresPerCU = 128
CUDADeviceName = "NVIDIA RTX 4000 Ada Generation"
CUDADriverVersion = 12.7
CUDAECCEnabled = false
CUDAGlobalMemoryMb = 20044
CUDAMaxSupportedVersion = 12070
AssignedGPUs =
"GPU-d845e39a,GPU-cac060cf,GPU-8307a4c3,GPU-15356487"
AvailableGPUs = {Â }
ChildGPUs = { 1,1,1,1 }
DetectedGPUs = "GPU-d845e39a, GPU-cac060cf, GPU-8307a4c3,
GPU-15356487"
DeviceGPUsAverageUsage = 0.0
DeviceGPUsMemoryPeakUsage = 435.0
GPU_15356487DevicePciBusId = "0000:82:00.0"
GPU_15356487DeviceUuid = "15356487-cd63-2969-0588-8c5d0192e4c2"
GPU_8307a4c3DevicePciBusId = "0000:81:00.0"
GPU_8307a4c3DeviceUuid = "8307a4c3-719f-f745-9f30-7d16b4601038"
GPU_cac060cfDevicePciBusId = "0000:04:00.0"
GPU_cac060cfDeviceUuid = "cac060cf-9cae-9ed1-c308-50c0d20ed969"
GPU_d845e39aDevicePciBusId = "0000:03:00.0"
GPU_d845e39aDeviceUuid = "d845e39a-b663-7071-869a-d79a921b291a"
GPURAM = 20475
GPUs = 0
GPUsMemoryUsage = undefined
MachineResources = "Cpus Memory Disk Swap GPUs"
StartOfJobUptimeGPUsSeconds = 0.0
TotalGPUs = 4
TotalSlotGPUs = 4
UptimeGPUsSecondsAverageUsage = (UptimeGPUsSeconds -
StartOfJobUptimeGPUsSeconds) / (LastUpdateUptimeGPUsSeconds -
FirstUpdateUptimeGPUsSeconds)
WithinResourceLimits = (MY.Cpus > 0 &&
TARGET.RequestCpus <= MY.Cpus && MY.Memory > 0
&& TARGET.RequestMemory <= MY.Memory && MY.Disk
> 0 && TARGET.RequestDisk <= MY.Disk &&
(TARGET.RequestGPUs =?= undefined || MY.GPUs >=
TARGET.RequestGPUs))
#####
My configuration looks like:
#####
use feature: GPUs
GPU_DISCOVERY_EXTRA = -extra -not-nested
ENVIRONMENT_VALUE_FOR_UnAssignedGPUs = void
#####
Do you have any suggestions?
Thanks for your help and best regards,
Daniel
Attachment:
smime.p7s
Description: Kryptografische S/MIME-Signatur