[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] CUDA not available when using pytorch running at 24.x nodes



Hi All,

After upgrading some nodes to 24.02, no CUDA Devices will be available. I used this python skript for testing:


####

import torch

print("CUDA available:", torch.cuda.is_available())
print("Number of GPUs:", torch.cuda.device_count())
print("CUDA version:", torch.version.cuda)

if torch.cuda.is_available():
ÂÂÂ for i in range(torch.cuda.device_count()):
ÂÂÂÂÂÂÂ print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
else:
ÂÂÂ print("No GPUs available.")
####

The job's output:

####

CUDA available: False
Number of GPUs: 0
CUDA version: 12.4
No GPUs available.

#####

When using a node running 23.0.18 I got this:

####
CUDA available: True
Number of GPUs: 1
CUDA version: 12.4
GPU 0: NVIDIA L40S
#####

ClassAds are set correctly:

####
condor_status -l node1.local |grep 'CUDA\|GPU'

CUDACapability = 8.9
CUDAClockMhz = 2175.0
CUDAComputeUnits = 48
CUDACoresPerCU = 128
CUDADeviceName = "NVIDIA RTX 4000 Ada Generation"
CUDADriverVersion = 12.7
CUDAECCEnabled = false
CUDAGlobalMemoryMb = 20044
CUDAMaxSupportedVersion = 12070
AssignedGPUs = "GPU-d845e39a,GPU-cac060cf,GPU-8307a4c3,GPU-15356487"
AvailableGPUs = {Â }
ChildGPUs = { 1,1,1,1 }
DetectedGPUs = "GPU-d845e39a, GPU-cac060cf, GPU-8307a4c3, GPU-15356487"
DeviceGPUsAverageUsage = 0.0
DeviceGPUsMemoryPeakUsage = 435.0
GPU_15356487DevicePciBusId = "0000:82:00.0"
GPU_15356487DeviceUuid = "15356487-cd63-2969-0588-8c5d0192e4c2"
GPU_8307a4c3DevicePciBusId = "0000:81:00.0"
GPU_8307a4c3DeviceUuid = "8307a4c3-719f-f745-9f30-7d16b4601038"
GPU_cac060cfDevicePciBusId = "0000:04:00.0"
GPU_cac060cfDeviceUuid = "cac060cf-9cae-9ed1-c308-50c0d20ed969"
GPU_d845e39aDevicePciBusId = "0000:03:00.0"
GPU_d845e39aDeviceUuid = "d845e39a-b663-7071-869a-d79a921b291a"
GPURAM = 20475
GPUs = 0
GPUsMemoryUsage = undefined
MachineResources = "Cpus Memory Disk Swap GPUs"
StartOfJobUptimeGPUsSeconds = 0.0
TotalGPUs = 4
TotalSlotGPUs = 4
UptimeGPUsSecondsAverageUsage = (UptimeGPUsSeconds - StartOfJobUptimeGPUsSeconds) / (LastUpdateUptimeGPUsSeconds - FirstUpdateUptimeGPUsSeconds)
WithinResourceLimits = (MY.Cpus > 0 && TARGET.RequestCpus <= MY.Cpus && MY.Memory > 0 && TARGET.RequestMemory <= MY.Memory && MY.Disk > 0 && TARGET.RequestDisk <= MY.Disk && (TARGET.RequestGPUs =?= undefined || MY.GPUs >= TARGET.RequestGPUs))

#####

My configuration looks like:

#####

use feature: GPUs
GPU_DISCOVERY_EXTRA = -extra -not-nested
ENVIRONMENT_VALUE_FOR_UnAssignedGPUs = void

#####

Do you have any suggestions?

Thanks for your help and best regards,

Daniel



  

Attachment: smime.p7s
Description: Kryptografische S/MIME-Signatur