[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] repeat/divide GPU options not working on condor 23.8.1//23.9.6?



Hi Todd,

Thank you very much for checking this.

Here you have the condor_status output for gpu03, running condor 23.10.1-1.el9 currently:

# condor_status -cons 'Machine=="gpu03.pic.es"' -af:h Name SlotType GPUs DetectedGpus AssignedGpus JobStatus
Name         SlotType   ÂGPUs DetectedGpus        AssignedGpus JobStatus
slot1@xxxxxxxxxxxx  Partitionable 0  ÂGPU-c659279d, GPU-c659279d undefined  Âundefined
slot1_1@xxxxxxxxxxxx Dynamic    0  ÂGPU-c659279d, GPU-c659279d undefined  Âundefined
slot1_2@xxxxxxxxxxxx Dynamic    0  ÂGPU-c659279d, GPU-c659279d undefined  Âundefined
slot1_3@xxxxxxxxxxxx Dynamic    0  ÂGPU-c659279d, GPU-c659279d undefined  Âundefined
slot1_4@xxxxxxxxxxxx Dynamic    0  ÂGPU-c659279d, GPU-c659279d undefined  Âundefined
slot1_5@xxxxxxxxxxxx Dynamic    0  ÂGPU-c659279d, GPU-c659279d undefined  Âundefined
slot1_6@xxxxxxxxxxxx Dynamic    0  ÂGPU-c659279d, GPU-c659279d undefined  Âundefined
slot1_7@xxxxxxxxxxxx Dynamic    0  ÂGPU-c659279d, GPU-c659279d undefined  Âundefined
slot1_8@xxxxxxxxxxxx Dynamic    0  ÂGPU-c659279d, GPU-c659279d undefined  Âundefined
slot2@xxxxxxxxxxxx  Partitionable 0  ÂGPU-c659279d, GPU-c659279d GPU-c659279d undefined
slot2_1@xxxxxxxxxxxx Dynamic    1  ÂGPU-c659279d, GPU-c659279d GPU-c659279d undefined

We can compare with a brother machine, gpu02, still running condor 23.0.10-1.el9:

# condor_status -cons 'Machine=="gpu02.pic.es"' -af:h Name SlotType GPUs DetectedGpus AssignedGpus JobStatus
Name         SlotType   ÂGPUs DetectedGpus        AssignedGpus       ÂJobStatus
slot1@xxxxxxxxxxxx  Partitionable 0  ÂGPU-0f8a8574, GPU-0f8a8574 undefined         undefined
slot1_1@xxxxxxxxxxxx Dynamic    0  ÂGPU-0f8a8574, GPU-0f8a8574 undefined         undefined
slot1_2@xxxxxxxxxxxx Dynamic    0  ÂGPU-0f8a8574, GPU-0f8a8574 undefined         undefined
slot1_3@xxxxxxxxxxxx Dynamic    0  ÂGPU-0f8a8574, GPU-0f8a8574 undefined         undefined
slot1_4@xxxxxxxxxxxx Dynamic    0  ÂGPU-0f8a8574, GPU-0f8a8574 undefined         undefined
slot1_5@xxxxxxxxxxxx Dynamic    0  ÂGPU-0f8a8574, GPU-0f8a8574 undefined         undefined
slot1_6@xxxxxxxxxxxx Dynamic    0  ÂGPU-0f8a8574, GPU-0f8a8574 undefined         undefined
slot1_7@xxxxxxxxxxxx Dynamic    0  ÂGPU-0f8a8574, GPU-0f8a8574 undefined         undefined
slot1_8@xxxxxxxxxxxx Dynamic    0  ÂGPU-0f8a8574, GPU-0f8a8574 undefined         undefined
slot2@xxxxxxxxxxxx  Partitionable 0  ÂGPU-0f8a8574, GPU-0f8a8574 GPU-0f8a8574,GPU-0f8a8574 undefined
slot2_1@xxxxxxxxxxxx Dynamic    1  ÂGPU-0f8a8574, GPU-0f8a8574 GPU-0f8a8574       Âundefined
slot2_2@xxxxxxxxxxxx Dynamic    1  ÂGPU-0f8a8574, GPU-0f8a8574 GPU-0f8a8574       Âundefined

So, even though there are two DetectedGpus, the slot2@gpu03 only shows 1 GPU on AssignedGpus. The SLOT definitions are:

[root@gpu03 ~]# condor_config_val SLOT_TYPE_1 SLOT_TYPE_2
cpus=8, gpus=0, auto
cpus=4, gpus=100%, auto

I'm draining gpu03 so I can do more testing if needed.

Thank you again.

Cheers,

Carles

On Tue, 8 Oct 2024 at 23:49, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
Hi Carles,

Regarding the below:

While trying to reproduce the problem, we found and fixed a bug involving the "-divide" option to condor_gpu_discovery. Details on this bug fix are here:
ÂÂ https://opensciencegrid.atlassian.net/browse/HTCONDOR-2669

However, we were not able to reproduce the core issue you describe below. Could you please re-run your test below, but this time use the following condor_status command which will give us more information:

ÂÂ condor_status -cons 'Machine=="gpu03.pic.es"' -af:h Name SlotType GPUs DetectedGpus AssignedGpus JobStatus

Specifically, the above command will give us additional information about all of the slots on the server (not just slot2), which will allow us to better judge if there is a problem and/or how to reproduce it.Â

Thank you,
Todd


On 10/4/2024 12:44 AM, Carles Acosta wrote:
Hi again,

I have checked it with version 23.10.1 and the problem persists.

Cheers,

Carles

On Wed, 2 Oct 2024 at 13:52, Carles Acosta <cacosta@xxxxxx> wrote:
Hi,

I have a machine with 1 GPU but we added the -divide 2 -reset 2 options in GPU_DISCOVERY_EXTRA to offer 2 GPUs. This was running fine on 23.0.12 and up to 23.7.2.

# condor_status slot2@xxxxxxxxxxxx -af CondorVersion Gpus DetectedGpus
$CondorVersion: 23.0.12 2024-06-13 BuildID: 739441 PackageID: 23.0.12-1 $ 2 GPU-c659279d, GPU-c659279d
# condor_config_val GPU_DISCOVERY_EXTRA MACHINE_RESOURCE_INVENTORY_GPUs
-repeat 2 -divide 2
/usr/libexec/condor/condor_gpu_discovery Â-properties -repeat 2 -divide 2

However, if we update to 23.8.1 or 23.9.6 this is not working anymore.Â

# condor_status slot2@xxxxxxxxxxxx -af CondorVersion Gpus DetectedGpus
$CondorVersion: 23.8.1 2024-06-27 BuildID: 742100 PackageID: 23.8.1-1 GitSHA: 8cf018d1 $ 1 GPU-c659279d, GPU-c659279d
]# condor_config_val GPU_DISCOVERY_EXTRA MACHINE_RESOURCE_INVENTORY_GPUs
-repeat 2 -divide 2
/usr/libexec/condor/condor_gpu_discovery Â-properties -repeat 2 -divide 2

There are 2 detected gpus but only one is shown by the condor_status command. I am searching for information about the 23.8.1 release, but I could not find any change related to condor_gpu_discovery:


Is this a bug or does something new have to be added in the config for divide/repeat options to work again?

Thank you in advance.

Cheers,

Carles

--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es


--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx>  University of Wisconsin-Madison
Center for High Throughput Computing    Department of Computer Sciences
Calendar: https://tinyurl.com/yd55mtgd  1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                   Madison, WI 53706-1685 


-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx>  University of Wisconsin-Madison
Center for High Throughput Computing    Department of Computer Sciences
Calendar: https://tinyurl.com/yd55mtgd  1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                   Madison, WI 53706-1685 


--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
http://www.pic.esÂ
AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es