condor_gpu_discovery -properties -extra
show on that node? what about
condor_gpu_discovery -properties -extra -not-nested
I notice you are using the -not-nested argument, the new submit keywords for GPU matchmaking like
gpus_minimum_memory = 0.1 require that the GPU properties be nested. Although those new submit keywords have a known
bug with the version of HTCondor you are using, and should not be used before 23.7
-tj
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of
Weatherby,Gerard <gweatherby@xxxxxxxx>
Sent: Tuesday, April 16, 2024 8:12 AM
To: HTCondor Users <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] HTCondor not picking up GPU memory?
*** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. ***
Our system does not seem to pick up GPU memory. e.g.
condor_status --gpus
Name ST User GPUs GPU-Memory GPU-Name
slot1@xxxxxxxxxxxxxxxxxxx Ui
_ 1 Tesla T4
slot1@xxxxxxxxxxxxxxxx Ui _
1 Tesla T4
slot1@xxxxxxxxxxxxxxxxxx Ui
_ 1 Tesla T4
slot1@xxxxxxxxxxxxxxxxxx Ui
_ 4 NVIDIA A100-SXM4-40GB
slot1@xxxxxxxxxxxxxxxxxxxx Ui
_ 1 Tesla T4
slot1@xxxxxxxxxxxxxxxxxx Ui
_ 1 Tesla T4
slot1@xxxxxxxxxxxxxxxxxxx Ui
_ 1 Tesla T4
slot1@xxxxxxxxxxxxxxxxxxx Ui
_ 1 Tesla T4
slot1@xxxxxxxxxxxxxxxxx Ui _
1 Tesla T4
and adding a gpus_minimum_memory = 0.1 results in no matches.
Weâre using use feature :GPUs and config_config_val -dump |grep GPU shows
ENVIRONMENT_FOR_AssignedGPUs = GPU_DEVICE_ORDINAL=/(CUDA|OCL)// CUDA_VISIBLE_DEVICES=/CUDA//
ENVIRONMENT_VALUE_FOR_UnAssignedGPUs = 10000
GPU_DISCOVERY_EXTRA = -extra -not-nested
GPU_MONITOR = $(LIBEXEC)/condor_gpu_utilization
MACHINE_RESOURCE_INVENTORY_GPUs = $(LIBEXEC)/condor_gpu_discovery -properties $(GPU_DISCOVERY_EXTRA)
STARTD_CRON_GPUs_MONITOR_CONDITION = TotalGPUs > 0
STARTD_CRON_GPUs_MONITOR_EXECUTABLE = $(GPU_MONITOR)
STARTD_CRON_GPUs_MONITOR_METRICS = SUM:GPUs, PEAK:GPUsMemory
STARTD_CRON_GPUs_MONITOR_MODE = WaitForExit
STARTD_CRON_GPUs_MONITOR_PERIOD = 300
STARTD_CRON_JOBLIST = GPUs_MONITOR
STARTD_DETECT_GPUS = -properties $(GPU_DISCOVERY_EXTRA)
STARTD_JOB_ATTRS = GPUsUsage GPUsMemoryUsage
STARTER_HIDE_GPU_DEVICES = true
htcondor version 23.5.2-1 running on Ubuntu20.04 servers