[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] dynamic slots with gpus



Hello.

Iâve enabled dynamic slots on 2 machines:

use feature : GPUs
GPU_DISCOVERY_EXTRA = -extra
# dynamic slot config
cpu = 24
# 24 * 662
memory = 15888
disk = BIG
NUM_SLOTS = 1
NUM_SLOTS_TYPE = 1
SLOT_TYPE_1 = 100%
SLOT_TYPE_1_PARTITIONABLE = TRUE


1 machine has 1 gpu, the other has 2 and condor_status  -long bench5 shows correct gpu info. 

I submit with:

[â]
request_cpus = 1
request_memory = 800 MB
request_disk = 1 GB
request_gpus = 1
should_transfer_files = yes
when_to_transfer_output = ON_EXIT
transfer_input_files =  enable_gpus.py, blender-3.5-splash.blend
queue 10


output of condor_q -better-analyze 111.002:

The Requirements expression for job 111.002 is

    (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) &&
    (TARGET.GPUs >= RequestGPUs) && (TARGET.HasFileTransfer)

Job 111.002 defines the following attributes:

    RequestDisk = 1048576
    RequestGPUs = 1
    RequestMemory = 800

The Requirements expression for job 111.002 reduces to these conditions:

         Slots
Step    Matched  Condition
-----  --------  ---------
[0]           6  TARGET.Arch == "X86_64"
[1]           6  TARGET.OpSys == "LINUX"
[3]           6  TARGET.Disk >= RequestDisk
[5]           6  TARGET.Memory >= RequestMemory
[7]           2  TARGET.GPUs >= RequestGPUs


111.002:  Job is running.

Last successful match: Fri Oct  6 12:19:16 2023

111.002:  Run analysis summary ignoring user priority.  Of 6 machines,
      4 are rejected by your job's requirements
      0 reject your job because of their own requirements
      1 match and are already running your jobs
      0 match but are serving other users
      1 are able to run your job


Only 1 machine (with 1 gpu) matches (and runs) all the jobs but I expected the machine with 2 gpus to be split into 2 partitions and run 2 jobs, 1 gpu each.  

Is there additional configuration for the 2 gpu machine?  Why doesnât it at least run 1 job? 

I tried request_gpus >= 1 in the submit file but thatâs a syntax error.  


Thanks,
JK