Hi Cole,
Thanks for your fast reply.
In my jobfile I was requesting at least one GPU using
"request_gpus = 1". The same job negotiated to a node with 23.0.18
works fine. I started this test script after all my colleagues
had problems with their pytorch environments.
my jobfile:
####
cmd = /work/scratch/condor/check_cuda.sh
args =
initialdir = /work/scratch/condor
output = check_cuda_out.log
error = check_cuda_err.log
log = check_cuda.log
request_memory = 1GB
request_gpus = 1
request_cpus = 1
should_transfer_files = ALWAYS
run_as_owner = True
load_profile = True
stream_error = true
stream_output = true
request_disk = 5GB
requirements = TARGET.Machine=="pc176"
queue 1
###
output using condor 23.0.18:
CUDA available: True
Number of GPUs: 1
CUDA version: 12.4
GPU 0: NVIDIA GeForce GTX 1660 SUPER
output using condor 24.0.2
CUDA available: False
Number of GPUs: 0
CUDA version: 12.4
No GPUs available.
btw.
running "nvidia-smi" in the slot leads to
"No devices were found"
Are there any changes in the node's configuration file how to
provide gpus?
Best,
Daniel
Hi Daniel,
If your test script is being run as an HTCondor job, you may not be seeing any available GPUs if you did not request any GPUs for the job. Greg added some linux eBPF black magic to make it, so the job only sees the GPUs that HTCondor provisions for the job.
Cheers,Cole
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Daniel BrÃckner <daniel.brueckner@xxxxxxxxxxxxxxxxxx>
Sent: Monday, December 16, 2024 8:11 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] CUDA not available when using pytorch running at 24.x nodesÂHi All,
After upgrading some nodes to 24.02, no CUDA Devices will be available. I used this python skript for testing:
####
import torch
print("CUDA available:", torch.cuda.is_available())
print("Number of GPUs:", torch.cuda.device_count())
print("CUDA version:", torch.version.cuda)
if torch.cuda.is_available():
ÂÂÂ for i in range(torch.cuda.device_count()):
ÂÂÂÂÂÂÂ print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
else:
ÂÂÂ print("No GPUs available.")
####
The job's output:
####
CUDA available: False Number of GPUs: 0 CUDA version: 12.4 No GPUs available.#####
When using a node running 23.0.18 I got this:
####
CUDA available: True
Number of GPUs: 1
CUDA version: 12.4
GPU 0: NVIDIA L40S
#####
ClassAds are set correctly:
####
condor_status -l node1.local |grep 'CUDA\|GPU'CUDACapability = 8.9
CUDAClockMhz = 2175.0
CUDAComputeUnits = 48
CUDACoresPerCU = 128
CUDADeviceName = "NVIDIA RTX 4000 Ada Generation"
CUDADriverVersion = 12.7
CUDAECCEnabled = false
CUDAGlobalMemoryMb = 20044
CUDAMaxSupportedVersion = 12070
AssignedGPUs = "GPU-d845e39a,GPU-cac060cf,GPU-8307a4c3,GPU-15356487"
AvailableGPUs = {Â }
ChildGPUs = { 1,1,1,1 }
DetectedGPUs = "GPU-d845e39a, GPU-cac060cf, GPU-8307a4c3, GPU-15356487"
DeviceGPUsAverageUsage = 0.0
DeviceGPUsMemoryPeakUsage = 435.0
GPU_15356487DevicePciBusId = "0000:82:00.0"
GPU_15356487DeviceUuid = "15356487-cd63-2969-0588-8c5d0192e4c2"
GPU_8307a4c3DevicePciBusId = "0000:81:00.0"
GPU_8307a4c3DeviceUuid = "8307a4c3-719f-f745-9f30-7d16b4601038"
GPU_cac060cfDevicePciBusId = "0000:04:00.0"
GPU_cac060cfDeviceUuid = "cac060cf-9cae-9ed1-c308-50c0d20ed969"
GPU_d845e39aDevicePciBusId = "0000:03:00.0"
GPU_d845e39aDeviceUuid = "d845e39a-b663-7071-869a-d79a921b291a"
GPURAM = 20475
GPUs = 0
GPUsMemoryUsage = undefined
MachineResources = "Cpus Memory Disk Swap GPUs"
StartOfJobUptimeGPUsSeconds = 0.0
TotalGPUs = 4
TotalSlotGPUs = 4
UptimeGPUsSecondsAverageUsage = (UptimeGPUsSeconds - StartOfJobUptimeGPUsSeconds) / (LastUpdateUptimeGPUsSeconds - FirstUpdateUptimeGPUsSeconds)
WithinResourceLimits = (MY.Cpus > 0 && TARGET.RequestCpus <= MY.Cpus && MY.Memory > 0 && TARGET.RequestMemory <= MY.Memory && MY.Disk > 0 && TARGET.RequestDisk <= MY.Disk && (TARGET.RequestGPUs =?= undefined || MY.GPUs >= TARGET.RequestGPUs))
#####
My configuration looks like:
#####
use feature: GPUs
GPU_DISCOVERY_EXTRA = -extra -not-nested
ENVIRONMENT_VALUE_FOR_UnAssignedGPUs = void#####
Do you have any suggestions?
Thanks for your help and best regards,
Daniel
_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
Attachment:
smime.p7s
Description: Kryptografische S/MIME-Signatur