[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor not picking up GPU memory?



what does running

   condor_gpu_discovery  -properties -extra

show on that node?  what about

   condor_gpu_discovery  -properties -extra -not-nested


I notice you are using the -not-nested argument,  the new submit keywords for GPU matchmaking like  gpus_minimum_memory = 0.1 require that the GPU properties be nested.  Although those new submit keywords have a known bug with the version of HTCondor you are using, and should not be used before 23.7 

-tj



From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Weatherby,Gerard <gweatherby@xxxxxxxx>
Sent: Tuesday, April 16, 2024 8:12 AM
To: HTCondor Users <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] HTCondor not picking up GPU memory?
 
*** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. ***

Our system does not seem to pick up GPU memory. e.g.

condor_status --gpus

Name                        ST User                GPUs GPU-Memory GPU-Name            

 

slot1@xxxxxxxxxxxxxxxxxxx   Ui _                      1            Tesla T4            

slot1@xxxxxxxxxxxxxxxx      Ui _                      1            Tesla T4            

slot1@xxxxxxxxxxxxxxxxxx    Ui _                      1            Tesla T4            

slot1@xxxxxxxxxxxxxxxxxx    Ui _                      4            NVIDIA A100-SXM4-40GB

slot1@xxxxxxxxxxxxxxxxxxxx  Ui _                      1            Tesla T4            

slot1@xxxxxxxxxxxxxxxxxx    Ui _                      1            Tesla T4            

slot1@xxxxxxxxxxxxxxxxxxx   Ui _                      1            Tesla T4            

slot1@xxxxxxxxxxxxxxxxxxx   Ui _                      1            Tesla T4            

slot1@xxxxxxxxxxxxxxxxx     Ui _                      1            Tesla T4         

 

and adding a gpus_minimum_memory = 0.1 results in no matches.


We’re using use feature :GPUs and config_config_val -dump |grep GPU shows

ENVIRONMENT_FOR_AssignedGPUs = GPU_DEVICE_ORDINAL=/(CUDA|OCL)//  CUDA_VISIBLE_DEVICES=/CUDA//

ENVIRONMENT_VALUE_FOR_UnAssignedGPUs = 10000

GPU_DISCOVERY_EXTRA = -extra -not-nested

GPU_MONITOR = $(LIBEXEC)/condor_gpu_utilization

MACHINE_RESOURCE_INVENTORY_GPUs = $(LIBEXEC)/condor_gpu_discovery  -properties $(GPU_DISCOVERY_EXTRA)

STARTD_CRON_GPUs_MONITOR_CONDITION = TotalGPUs > 0

STARTD_CRON_GPUs_MONITOR_EXECUTABLE = $(GPU_MONITOR)

STARTD_CRON_GPUs_MONITOR_METRICS = SUM:GPUs, PEAK:GPUsMemory

STARTD_CRON_GPUs_MONITOR_MODE = WaitForExit

STARTD_CRON_GPUs_MONITOR_PERIOD = 300

STARTD_CRON_JOBLIST =  GPUs_MONITOR

STARTD_DETECT_GPUS = -properties $(GPU_DISCOVERY_EXTRA)

STARTD_JOB_ATTRS =  GPUsUsage GPUsMemoryUsage

STARTER_HIDE_GPU_DEVICES = true

htcondor version 23.5.2-1 running on Ubuntu20.04 servers