Re: [HTCondor-users] CUDA not available when using pytorch running at 24.x nodes

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Hi All,

After upgrading some nodes to 24.02, no CUDA Devices will be available. I used this python skript for testing:

####

import torch

print("CUDA available:", torch.cuda.is_available())
print("Number of GPUs:", torch.cuda.device_count())
print("CUDA version:", torch.version.cuda)

if torch.cuda.is_available():
ÂÂÂ for i in range(torch.cuda.device_count()):
ÂÂÂÂÂÂÂ print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
else:
ÂÂÂ print("No GPUs available.")
####

The job's output:

####

CUDA available: False
Number of GPUs: 0
CUDA version: 12.4
No GPUs available.

#####

When using a node running 23.0.18 I got this:

####
CUDA available: True
Number of GPUs: 1
CUDA version: 12.4
GPU 0: NVIDIA L40S
#####

ClassAds are set correctly:

####
condor_status -l node1.local |grep 'CUDA\|GPU'

CUDACapability = 8.9
CUDAClockMhz = 2175.0
CUDAComputeUnits = 48
CUDACoresPerCU = 128
CUDADeviceName = "NVIDIA RTX 4000 Ada Generation"
CUDADriverVersion = 12.7
CUDAECCEnabled = false
CUDAGlobalMemoryMb = 20044
CUDAMaxSupportedVersion = 12070
AssignedGPUs = "GPU-d845e39a,GPU-cac060cf,GPU-8307a4c3,GPU-15356487"
AvailableGPUs = {Â }
ChildGPUs = { 1,1,1,1 }
DetectedGPUs = "GPU-d845e39a, GPU-cac060cf, GPU-8307a4c3, GPU-15356487"
DeviceGPUsAverageUsage = 0.0
DeviceGPUsMemoryPeakUsage = 435.0
GPU_15356487DevicePciBusId = "0000:82:00.0"
GPU_15356487DeviceUuid = "15356487-cd63-2969-0588-8c5d0192e4c2"
GPU_8307a4c3DevicePciBusId = "0000:81:00.0"
GPU_8307a4c3DeviceUuid = "8307a4c3-719f-f745-9f30-7d16b4601038"
GPU_cac060cfDevicePciBusId = "0000:04:00.0"
GPU_cac060cfDeviceUuid = "cac060cf-9cae-9ed1-c308-50c0d20ed969"
GPU_d845e39aDevicePciBusId = "0000:03:00.0"
GPU_d845e39aDeviceUuid = "d845e39a-b663-7071-869a-d79a921b291a"
GPURAM = 20475
GPUs = 0
GPUsMemoryUsage = undefined
MachineResources = "Cpus Memory Disk Swap GPUs"
StartOfJobUptimeGPUsSeconds = 0.0
TotalGPUs = 4
TotalSlotGPUs = 4
UptimeGPUsSecondsAverageUsage = (UptimeGPUsSeconds - StartOfJobUptimeGPUsSeconds) / (LastUpdateUptimeGPUsSeconds - FirstUpdateUptimeGPUsSeconds)
WithinResourceLimits = (MY.Cpus > 0 && TARGET.RequestCpus <= MY.Cpus && MY.Memory > 0 && TARGET.RequestMemory <= MY.Memory && MY.Disk > 0 && TARGET.RequestDisk <= MY.Disk && (TARGET.RequestGPUs =?= undefined || MY.GPUs >= TARGET.RequestGPUs))

#####

My configuration looks like:

#####

use feature: GPUs
GPU_DISCOVERY_EXTRA = -extra -not-nested
ENVIRONMENT_VALUE_FOR_UnAssignedGPUs = void

#####

Do you have any suggestions?

Thanks for your help and best regards,

Daniel

Mailing List Archives

Authenticated access

Re: [HTCondor-users] CUDA not available when using pytorch running at 24.x nodes