Hi Daniel,
HTCondor is 'correctly' setting the assigned GPUs into the job's environment (I put correctly in quotes because I am assuming that the assigned GPU is one that is discovered on that EP). I think the issue is either with how torch is discovering GPUs or the
black magic eBPF functionality that Greg added to disable GPUs not assigned to the job not functioning correctly. Do you have the ability to change the configuration of the v24.0.2 EP. If yes, try setting STARTER_HIDE_GPU_DEVICES=False on one and submit the
test job to target that EP. No reconfig or restart is necessary because the new Starter will pick up the config value. Please let me know if turning that off resolves the issue because then there may be a bug in the code.
-Cole Bollig
PS - I have been told to inform you that the -no-nested flag for the condor_gpu_discovery is going to go away in the future and does not work with heterogenous GPU hosts. We recommend trying to switch away from that option.
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Daniel Brückner <daniel.brueckner@xxxxxxxxxxxxxxxxxx>
Sent: Monday, December 16, 2024 10:27 AM To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx> Subject: Re: [HTCondor-users] CUDA not available when using pytorch running at 24.x nodes Hi Cole,
Yes of course.
Output from a 23.0.18 worker node (same job, same python environment):
Am 16.12.2024 um 17:05 schrieb Cole Bollig via HTCondor-users:
-- Daniel Brückner IT Business Manager LFB - Lehrstuhl für Bildverarbeitung RWTH Aachen University Kopernikusstr. 16 D-52074 Aachen Tel: +49 241 80-27850 Fax: +49 241 80-627850 daniel.brueckner@xxxxxxxxxxxxxxxxxx www.lfb.rwth-aachen.de |