Hi Vikrant,
I can't say for certainty if the discrepancy in the memory size is because due to condor or the CUDA libraries. I am more inclined to think that the CUDA libraries are at fault since condor_gpu_discovery is just grabbing the memory in bytes and converting it
into MB.
As for running the command, one thing to note is that using both -repeat and -divide is pointless since the last one specified dictates behavior. Plus, -divide is just -repeat with the GlobalMemoryMb divided by the number of repeats. This repetition of GPU
devices is shown in the DetectedGpus list. Your grep may be hiding information from you. Finally, you should be able to divide or repeat by any integer value greater than one.
-Cole Bollig
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Vikrant Aggarwal <ervikrant06@xxxxxxxxx>
Sent: Thursday, January 11, 2024 2:40 PM To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> Subject: [HTCondor-users] Condor version 9.0.17 GPU repeat/divide with GPU MIGs Hello Experts,
We have a NVIDIA H100 GPU on a machine, created 7 MIG on this GPU.
- Following value is same for 6 MIGs but one MIG has different value 9969MB
GlobalMemoryMb=9971
nvidia-smi command shows the same value 9984 MiB for all MIGs.
Is this a condor or CUDA library issue?
- Using the following command to divide the MIG further. It shows global memory less than devicememory. Should not I expect to see two devices each with 4985MB of memory? Also I can't increase the value of repeat and divide (something like 4 I don't need
it but is thr a reason behind it?)
# `condor_config_val LIBEXEC`/condor_gpu_discovery -extra -repeat 2 -divide 2 | grep MIG_ea8bzzz
MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxxDeviceMemoryMb=9971 MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxxDeviceUuid="MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxx" MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxxGlobalMemoryMb=4985 It only shows the deviememory - globalmemory disappeared. # `condor_config_val LIBEXEC`/condor_gpu_discovery -extra -repeat 4 -divide 4 | grep MIG_ea8bzzz MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxxDeviceMemoryMb=9971 MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxxDeviceUuid="MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxx" Without divide and repeat: # `condor_config_val LIBEXEC`/condor_gpu_discovery -extra | grep MIG_ea8bzzz MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxxDeviceUuid="MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxx" MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxxGlobalMemoryMb=9971 Thanks & Regards,
Vikrant Aggarwal
|