Hello Experts,
We have a NVIDIA H100 GPU on a machine, created 7 MIG on this GPU.
- Following value is same for 6 MIGs but one MIG has different value 9969MB
GlobalMemoryMb=9971Â
nvidia-smi command shows the same value 9984 MiB for all MIGs.Â
Is this a condor or CUDA library issue?Â
- Using the following command to divide the MIG further. It shows global memory less than devicememory. Should not I expect to see two devices each with 4985MB of memory? Also I can't increase the value of repeat and divide (something like 4 I don't need it but is thr a reason behind it?)
# `condor_config_val LIBEXEC`/condor_gpu_discovery -extra -repeat 2 -divide 2 | grep MIG_ea8bzzz
MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxxDeviceMemoryMb=9971
MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxxDeviceUuid="MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxx"
MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxxGlobalMemoryMb=4985
It only shows the deviememory - globalmemory disappeared.Â
# Â`condor_config_val LIBEXEC`/condor_gpu_discovery -extra -repeat 4 -divide 4 | grep MIG_ea8bzzz
MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxxDeviceMemoryMb=9971
MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxxDeviceUuid="MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxx"
Without divide and repeat:
# Â`condor_config_val LIBEXEC`/condor_gpu_discovery -extra | grep MIG_ea8bzzz
MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxxDeviceUuid="MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxx"
MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxxGlobalMemoryMb=9971
Thanks & Regards,
Vikrant Aggarwal