what does running
condor_gpu_discovery -properties -extra
show on that node? what about
condor_gpu_discovery -properties -extra -not-nested
I notice you are using the -not-nested argument, the new submit keywords for GPU matchmaking like
gpus_minimum_memory = 0.1 require that the GPU properties be nested. Although those new
submit keywords have a known bug with the version of HTCondor you are using, and should not be used before 23.7
-tj
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Weatherby,Gerard <gweatherby@xxxxxxxx>
Sent: Tuesday, April 16, 2024 8:12 AM To: HTCondor Users <htcondor-users@xxxxxxxxxxx> Subject: [HTCondor-users] HTCondor not picking up GPU memory?
*** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. ***
Our system does not seem to pick up GPU memory. e.g. Name ST User GPUs GPU-Memory GPU-Name
slot1@xxxxxxxxxxxxxxxxxxx Ui _ 1 Tesla T4 slot1@xxxxxxxxxxxxxxxx Ui _ 1 Tesla T4 slot1@xxxxxxxxxxxxxxxxxx Ui _ 1 Tesla T4 slot1@xxxxxxxxxxxxxxxxxx Ui _ 4 NVIDIA A100-SXM4-40GB slot1@xxxxxxxxxxxxxxxxxxxx Ui _ 1 Tesla T4 slot1@xxxxxxxxxxxxxxxxxx Ui _ 1 Tesla T4 slot1@xxxxxxxxxxxxxxxxxxx Ui _ 1 Tesla T4 slot1@xxxxxxxxxxxxxxxxxxx Ui _ 1 Tesla T4 slot1@xxxxxxxxxxxxxxxxx Ui _ 1 Tesla T4
and adding a gpus_minimum_memory = 0.1 results in no matches.
ENVIRONMENT_VALUE_FOR_UnAssignedGPUs = 10000 GPU_DISCOVERY_EXTRA = -extra -not-nested GPU_MONITOR = $(LIBEXEC)/condor_gpu_utilization MACHINE_RESOURCE_INVENTORY_GPUs = $(LIBEXEC)/condor_gpu_discovery -properties $(GPU_DISCOVERY_EXTRA) STARTD_CRON_GPUs_MONITOR_CONDITION = TotalGPUs > 0 STARTD_CRON_GPUs_MONITOR_EXECUTABLE = $(GPU_MONITOR) STARTD_CRON_GPUs_MONITOR_METRICS = SUM:GPUs, PEAK:GPUsMemory STARTD_CRON_GPUs_MONITOR_MODE = WaitForExit STARTD_CRON_GPUs_MONITOR_PERIOD = 300 STARTD_CRON_JOBLIST = GPUs_MONITOR STARTD_DETECT_GPUS = -properties $(GPU_DISCOVERY_EXTRA) STARTD_JOB_ATTRS = GPUsUsage GPUsMemoryUsage STARTER_HIDE_GPU_DEVICES = true |