Re: [HTCondor-users] CUDA not available when using pytorch running at 24.x nodes

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Hi Cole.

Am 16.12.2024 um 18:27 schrieb Cole Bollig via HTCondor-users:

Hi Daniel,

HTCondor is 'correctly' setting the assigned GPUs into the job's environment (I put correctly in quotes because I am assuming that the assigned GPU is one that is discovered on that EP). I think the issue is either with how torch is discovering GPUs or the black magic eBPF functionality that Greg added to disable GPUs not assigned to the job not functioning correctly.

I don't think it's a problem with pytorch, as the command ‘nvidia-smi’ doesn't recognise an assigned graphics card either.

#####

condor_submit --interactive check_cuda.tbi
Submitting job(s).
1 job(s) submitted to cluster 121938.
Waiting for job to start...
Warning: No xauth data; using fake authentication data for X11 forwarding.
Welcome to slot1_1@node10!
bruckner@node10:/work/local/condor/dir_12598$ nvidia-smi
No devices were found

######

Do you have the ability to change the configuration of the v24.0.2 EP. If yes, try setting STARTER_HIDE_GPU_DEVICES=False on one and submit the test job to target that EP. No reconfig or restart is necessary because the new Starter will pick up the config value. Please let me know if turning that off resolves the issue because then there may be a bug in the code.

I changed the configuration and the job starts. nvidia-smi shows all gpu cards.

####
condor_submit --interactive check_cuda.tbi
Submitting job(s).
1 job(s) submitted to cluster 121936.
Waiting for job to start...
Warning: No xauth data; using fake authentication data for X11 forwarding.
Welcome to slot1_1@node10!
bruckner@node10:/work/local/condor/dir_8367$ nvidia-smi
Tue Dec 17 10:16:26 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0 NVIDIA L40S                    Off |   00000000:4A:00.0 Off |                    0 |
| N/A   23C    P8             32W / 350W |       4MiB / 46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1 NVIDIA L40S                    Off |   00000000:61:00.0 Off |                    0 |
| N/A   23C    P8             31W / 350W |       4MiB / 46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2 NVIDIA L40S                    Off |   00000000:CA:00.0 Off |                    0 |
| N/A   26C    P8             33W / 350W |       4MiB / 46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3 NVIDIA L40S                    Off |   00000000:E1:00.0 Off |                    0 |
| N/A   25C    P8             32W / 350W |       4MiB / 46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
| GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
| No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

#####

But pytorch only sees the assigned GPUs (as expected since CUDA_VISIBLE_DEVICES was set correctly)

####
bruckner@node10:/work/local/condor/dir_8367$ python /work/scratch/bruckner/cuda_test.py
CUDA available: True
Number of GPUs: 1
CUDA version: 12.4
GPU 0: NVIDIA L40S
####

-Cole Bollig

PS - I have been told to inform you that the -no-nested flag for the condor_gpu_discovery is going to go away in the future and does not work with heterogenous GPU hosts. We recommend trying to switch away from that option.

I removed this flag and tested. What shall I say? This was the solution.

After removing "STARTER_HIDE_GPU_DEVICES=False" again, ´nvidia-smi´ doesn't show the hidden gpu cards:

####

bruckner@node10:/work/local/condor/dir_10697$ nvidia-smi
Tue Dec 17 10:22:56 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0 NVIDIA L40S                    Off |   00000000:4A:00.0 Off |                    0 |
| N/A   23C    P8             32W / 350W |       4MiB / 46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
| GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
| No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

####
bruckner@node10:/work/local/condor/dir_8367$ python /work/scratch/bruckner/cuda_test.py
CUDA available: True
Number of GPUs: 1
CUDA version: 12.4
GPU 0: NVIDIA L40S
####

Conclusion:

The obsolete ‘no-nested’ flag for gpu-discovery caused this problem. Removing this parameter solves my problem. Thank you for your help, Cole.

Best,

Daniel

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Daniel Brückner <daniel.brueckner@xxxxxxxxxxxxxxxxxx>
Sent: Monday, December 16, 2024 10:27 AM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] CUDA not available when using pytorch running at 24.x nodes
Hi Cole,

Yes of course.

####
CUDA available: False
Number of GPUs: 0
CUDA version: 12.4
No GPUs available.
NVIDIA_VISIBLE_DEVICES: none
SINGULARITY_CACHEDIR: /work/local/condor//dir_25027
_CONDOR_CHIRP_CONFIG: /work/local/condor//dir_25027/.chirp.config
_CONDOR_JOB_IWD: /work/local/condor//dir_25027
_CONDOR_MACHINE_AD: /work/local/condor//dir_25027/.machine.ad
ROOT_MAX_THREADS: 1
NUMEXPR_NUM_THREADS: 1
CUBACORES: 1
_CHIRP_DELAYED_UPDATE_PREFIX: Chirp*
OMP_THREAD_LIMIT: 1
PWD: /work/local/condor/dir_25027
_CONDOR_ANCESTOR_25027: 25285:1734365992:986306514
_CONDOR_JOB_AD: /work/local/condor//dir_25027/.job.ad
PYTHON_CPU_COUNT: 1
OPENBLAS_NUM_THREADS: 1
_CONDOR_BIN: /usr/bin
TMPDIR: /tmp
_CONDOR_SCRATCH_DIR: /work/local/condor//dir_25027
CUDA_VISIBLE_DEVICES: GPU-5a1299ef
_CONDOR_ANCESTOR_4065110: 4065312:1734359176:2624350342
_CONDOR_JOB_PIDS:
GPU_DEVICE_ORDINAL: GPU-5a1299ef
TEMP: /tmp
GOMAXPROCS: 1
SHLVL: 1
TF_NUM_THREADS: 1
APPTAINER_CACHEDIR: /work/local/condor//dir_25027
BATCH_SYSTEM: HTCondor
_CONDOR_AssignedGPUs: GPU-5a1299ef
_CONDOR_SLOT: slot1_1
TF_LOOP_PARALLEL_ITERATIONS: 1
OMP_NUM_THREADS: 1
TMP: /tmp
JULIA_NUM_THREADS: 1
MKL_NUM_THREADS: 1
_CONDOR_ANCESTOR_4065312: 25027:1734365991:3121861327
_: /work/scratch/bruckner/miniconda3/bin/python
LC_CTYPE: C.UTF-8
####

Output from a 23.0.18 worker node (same job, same python environment):

####
CUDA available: True
Number of GPUs: 1
CUDA version: 12.4
GPU 0: NVIDIA TITAN X (Pascal)
NVIDIA_VISIBLE_DEVICES: none
_CONDOR_CHIRP_CONFIG: /var/lib/condor/execute/dir_1233120/.chirp.config
_CONDOR_JOB_IWD: /var/lib/condor/execute/dir_1233120
_CONDOR_MACHINE_AD: /var/lib/condor/execute/dir_1233120/.machine.ad
_CONDOR_ANCESTOR_4103540: 4103713:1734319037:1677449644
ROOT_MAX_THREADS: 1
NUMEXPR_NUM_THREADS: 1
CUBACORES: 1
_CHIRP_DELAYED_UPDATE_PREFIX: Chirp*
OMP_THREAD_LIMIT: 1
PWD: /var/lib/condor/execute/dir_1233120
_CONDOR_JOB_AD: /var/lib/condor/execute/dir_1233120/.job.ad
OPENBLAS_NUM_THREADS: 1
_CONDOR_BIN: /usr/bin
_CONDOR_ANCESTOR_4103713: 1233120:1734366314:1013061074
TMPDIR: /tmp
_CONDOR_SCRATCH_DIR: /var/lib/condor/execute/dir_1233120
CUDA_VISIBLE_DEVICES: GPU-9eb5c1b7
_CONDOR_JOB_PIDS:
GPU_DEVICE_ORDINAL: GPU-9eb5c1b7
_CONDOR_ANCESTOR_1233120: 1233161:1734366315:3534835719
TEMP: /tmp
GOMAXPROCS: 1
SHLVL: 1
TF_NUM_THREADS: 1
BATCH_SYSTEM: HTCondor
_CONDOR_AssignedGPUs: GPU-9eb5c1b7
_CONDOR_SLOT: slot1_1
TF_LOOP_PARALLEL_ITERATIONS: 1
OMP_NUM_THREADS: 1
TMP: /tmp
JULIA_NUM_THREADS: 1
MKL_NUM_THREADS: 1
_: /work/scratch/bruckner/miniconda3/bin/python
LC_CTYPE: C.UTF-8
CUDA_MODULE_LOADING: LAZY
#####

Best,

Daniel

Am 16.12.2024 um 17:05 schrieb Cole Bollig via HTCondor-users:
Hi Daniel,

Could you modify your test script to dump all of the environment variables and share the output?

-Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Daniel Brückner <daniel.brueckner@xxxxxxxxxxxxxxxxxx>
Sent: Monday, December 16, 2024 8:40 AM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] CUDA not available when using pytorch running at 24.x nodes
Hi Cole,

Thanks for your fast reply.

In my jobfile I was requesting at least one GPU using "request_gpus = 1". The same job negotiated to a node with 23.0.18 works fine. I started this test script after all my colleagues had problems with their pytorch environments.

my jobfile:

####
cmd = /work/scratch/condor/check_cuda.sh
args =
initialdir = /work/scratch/condor
output = check_cuda_out.log
error = check_cuda_err.log
log = check_cuda.log
request_memory = 1GB
request_gpus = 1
request_cpus = 1
should_transfer_files = ALWAYS
run_as_owner = True
load_profile = True
stream_error = true
stream_output = true
request_disk = 5GB
requirements = TARGET.Machine=="pc176"
queue 1

###

output using condor 23.0.18:

CUDA available: True
Number of GPUs: 1
CUDA version: 12.4
GPU 0: NVIDIA GeForce GTX 1660 SUPER

output using condor 24.0.2

CUDA available: False
Number of GPUs: 0
CUDA version: 12.4
No GPUs available.

btw.

running "nvidia-smi" in the slot leads to

"No devices were found"

Are there any changes in the node's configuration file how to provide gpus?

Best,

Daniel

Am 16.12.2024 um 15:21 schrieb Cole Bollig via HTCondor-users:
Hi Daniel,

If your test script is being run as an HTCondor job, you may not be seeing any available GPUs if you did not request any GPUs for the job. Greg added some linux eBPF black magic to make it, so the job only sees the GPUs that HTCondor provisions for the job.

Cheers,

Cole

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Daniel Brückner <daniel.brueckner@xxxxxxxxxxxxxxxxxx>
Sent: Monday, December 16, 2024 8:11 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] CUDA not available when using pytorch running at 24.x nodes
Hi All,

After upgrading some nodes to 24.02, no CUDA Devices will be available. I used this python skript for testing:

####

import torch

print("CUDA available:", torch.cuda.is_available())
print("Number of GPUs:", torch.cuda.device_count())
print("CUDA version:", torch.version.cuda)

if torch.cuda.is_available():
    for i in range(torch.cuda.device_count()):
        print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
else:
    print("No GPUs available.")
####

The job's output:

####
CUDA available: False
Number of GPUs: 0
CUDA version: 12.4
No GPUs available.
#####

When using a node running 23.0.18 I got this:

####
CUDA available: True
Number of GPUs: 1
CUDA version: 12.4
GPU 0: NVIDIA L40S
#####

ClassAds are set correctly:

####
condor_status -l node1.local |grep 'CUDA\|GPU'

CUDACapability = 8.9
CUDAClockMhz = 2175.0
CUDAComputeUnits = 48
CUDACoresPerCU = 128
CUDADeviceName = "NVIDIA RTX 4000 Ada Generation"
CUDADriverVersion = 12.7
CUDAECCEnabled = false
CUDAGlobalMemoryMb = 20044
CUDAMaxSupportedVersion = 12070
AssignedGPUs = "GPU-d845e39a,GPU-cac060cf,GPU-8307a4c3,GPU-15356487"
AvailableGPUs = { }
ChildGPUs = { 1,1,1,1 }
DetectedGPUs = "GPU-d845e39a, GPU-cac060cf, GPU-8307a4c3, GPU-15356487"
DeviceGPUsAverageUsage = 0.0
DeviceGPUsMemoryPeakUsage = 435.0
GPU_15356487DevicePciBusId = "0000:82:00.0"
GPU_15356487DeviceUuid = "15356487-cd63-2969-0588-8c5d0192e4c2"
GPU_8307a4c3DevicePciBusId = "0000:81:00.0"
GPU_8307a4c3DeviceUuid = "8307a4c3-719f-f745-9f30-7d16b4601038"
GPU_cac060cfDevicePciBusId = "0000:04:00.0"
GPU_cac060cfDeviceUuid = "cac060cf-9cae-9ed1-c308-50c0d20ed969"
GPU_d845e39aDevicePciBusId = "0000:03:00.0"
GPU_d845e39aDeviceUuid = "d845e39a-b663-7071-869a-d79a921b291a"
GPURAM = 20475
GPUs = 0
GPUsMemoryUsage = undefined
MachineResources = "Cpus Memory Disk Swap GPUs"
StartOfJobUptimeGPUsSeconds = 0.0
TotalGPUs = 4
TotalSlotGPUs = 4
UptimeGPUsSecondsAverageUsage = (UptimeGPUsSeconds - StartOfJobUptimeGPUsSeconds) / (LastUpdateUptimeGPUsSeconds - FirstUpdateUptimeGPUsSeconds)
WithinResourceLimits = (MY.Cpus > 0 && TARGET.RequestCpus <= MY.Cpus && MY.Memory > 0 && TARGET.RequestMemory <= MY.Memory && MY.Disk > 0 && TARGET.RequestDisk <= MY.Disk && (TARGET.RequestGPUs =?= undefined || MY.GPUs >= TARGET.RequestGPUs))

#####

My configuration looks like:

#####

use feature: GPUs
GPU_DISCOVERY_EXTRA = -extra -not-nested
ENVIRONMENT_VALUE_FOR_UnAssignedGPUs = void

#####

Do you have any suggestions?

Thanks for your help and best regards,

Daniel
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
-- 
Daniel Brückner
IT Business Manager

LFB - Lehrstuhl für Bildverarbeitung
RWTH Aachen University
Kopernikusstr. 16
D-52074 Aachen
Tel: +49 241 80-27850
Fax: +49 241 80-627850
daniel.brueckner@xxxxxxxxxxxxxxxxxx
www.lfb.rwth-aachen.de
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/

Mailing List Archives

Authenticated access

Re: [HTCondor-users] CUDA not available when using pytorch running at 24.x nodes