Thank you for the update, Todd! :) If you need any extra tests, just let me know.
Cheers,
Carles
On Wed, 9 Oct 2024 at 20:02, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
Hi Carles,
Just an update: I have now successfully reproduced your problem below on my laptop :) This is at least half the battle! Thank you for the detailed report. Hopefully will have news soon regarding a fix, I will keep you posted here.
regards,
Todd
On 10/9/2024 12:11 AM, Carles Acosta wrote:
Hi Todd,
Thank you very much for checking this.
Here you have the condor_status output for gpu03, running condor 23.10.1-1.el9 currently:
# condor_status -cons 'Machine=="gpu03.pic.es"' -af:h Name SlotType GPUs DetectedGpus AssignedGpus JobStatus
Name SlotType GPUs DetectedGpus AssignedGpus JobStatus
slot1@xxxxxxxxxxxx Partitionable 0 GPU-c659279d, GPU-c659279d undefined undefined
slot1_1@xxxxxxxxxxxx Dynamic 0 GPU-c659279d, GPU-c659279d undefined undefined
slot1_2@xxxxxxxxxxxx Dynamic 0 GPU-c659279d, GPU-c659279d undefined undefined
slot1_3@xxxxxxxxxxxx Dynamic 0 GPU-c659279d, GPU-c659279d undefined undefined
slot1_4@xxxxxxxxxxxx Dynamic 0 GPU-c659279d, GPU-c659279d undefined undefined
slot1_5@xxxxxxxxxxxx Dynamic 0 GPU-c659279d, GPU-c659279d undefined undefined
slot1_6@xxxxxxxxxxxx Dynamic 0 GPU-c659279d, GPU-c659279d undefined undefined
slot1_7@xxxxxxxxxxxx Dynamic 0 GPU-c659279d, GPU-c659279d undefined undefined
slot1_8@xxxxxxxxxxxx Dynamic 0 GPU-c659279d, GPU-c659279d undefined undefined
slot2@xxxxxxxxxxxx Partitionable 0 GPU-c659279d, GPU-c659279d GPU-c659279d undefined
slot2_1@xxxxxxxxxxxx Dynamic 1 GPU-c659279d, GPU-c659279d GPU-c659279d undefined
We can compare with a brother machine, gpu02, still running condor 23.0.10-1.el9:
# condor_status -cons 'Machine=="gpu02.pic.es"' -af:h Name SlotType GPUs DetectedGpus AssignedGpus JobStatus
Name SlotType GPUs DetectedGpus AssignedGpus JobStatus
slot1@xxxxxxxxxxxx Partitionable 0 GPU-0f8a8574, GPU-0f8a8574 undefined undefined
slot1_1@xxxxxxxxxxxx Dynamic 0 GPU-0f8a8574, GPU-0f8a8574 undefined undefined
slot1_2@xxxxxxxxxxxx Dynamic 0 GPU-0f8a8574, GPU-0f8a8574 undefined undefined
slot1_3@xxxxxxxxxxxx Dynamic 0 GPU-0f8a8574, GPU-0f8a8574 undefined undefined
slot1_4@xxxxxxxxxxxx Dynamic 0 GPU-0f8a8574, GPU-0f8a8574 undefined undefined
slot1_5@xxxxxxxxxxxx Dynamic 0 GPU-0f8a8574, GPU-0f8a8574 undefined undefined
slot1_6@xxxxxxxxxxxx Dynamic 0 GPU-0f8a8574, GPU-0f8a8574 undefined undefined
slot1_7@xxxxxxxxxxxx Dynamic 0 GPU-0f8a8574, GPU-0f8a8574 undefined undefined
slot1_8@xxxxxxxxxxxx Dynamic 0 GPU-0f8a8574, GPU-0f8a8574 undefined undefined
slot2@xxxxxxxxxxxx Partitionable 0 GPU-0f8a8574, GPU-0f8a8574 GPU-0f8a8574,GPU-0f8a8574 undefined
slot2_1@xxxxxxxxxxxx Dynamic 1 GPU-0f8a8574, GPU-0f8a8574 GPU-0f8a8574 undefined
slot2_2@xxxxxxxxxxxx Dynamic 1 GPU-0f8a8574, GPU-0f8a8574 GPU-0f8a8574 undefined
So, even though there are two DetectedGpus, the slot2@gpu03 only shows 1 GPU on AssignedGpus. The SLOT definitions are:
[root@gpu03 ~]# condor_config_val SLOT_TYPE_1 SLOT_TYPE_2
cpus=8, gpus=0, auto
cpus=4, gpus=100%, auto
I'm draining gpu03 so I can do more testing if needed.
Thank you again.
Cheers,
Carles
On Tue, 8 Oct 2024 at 23:49, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
Hi Carles,
Regarding the below:
While trying to reproduce the problem, we found and fixed a bug involving the "-divide" option to condor_gpu_discovery. Details on this bug fix are here:
https://opensciencegrid.atlassian.net/browse/HTCONDOR-2669
However, we were not able to reproduce the core issue you describe below. Could you please re-run your test below, but this time use the following condor_status command which will give us more information:
condor_status -cons 'Machine=="gpu03.pic.es"' -af:h Name SlotType GPUs DetectedGpus AssignedGpus JobStatus
Specifically, the above command will give us additional information about all of the slots on the server (not just slot2), which will allow us to better judge if there is a problem and/or how to reproduce it.
Thank you,
Todd
On 10/4/2024 12:44 AM, Carles Acosta wrote:
Hi again,
I have checked it with version 23.10.1 and the problem persists.
Cheers,
Carles
On Wed, 2 Oct 2024 at 13:52, Carles Acosta <cacosta@xxxxxx> wrote:
Hi,
I have a machine with 1 GPU but we added the -divide 2 -reset 2 options in GPU_DISCOVERY_EXTRA to offer 2 GPUs. This was running fine on 23.0.12 and up to 23.7.2.
# condor_status slot2@xxxxxxxxxxxx -af CondorVersion Gpus DetectedGpus
$CondorVersion: 23.0.12 2024-06-13 BuildID: 739441 PackageID: 23.0.12-1 $ 2 GPU-c659279d, GPU-c659279d
# condor_config_val GPU_DISCOVERY_EXTRA MACHINE_RESOURCE_INVENTORY_GPUs
-repeat 2 -divide 2
/usr/libexec/condor/condor_gpu_discovery -properties -repeat 2 -divide 2
However, if we update to 23.8.1 or 23.9.6 this is not working anymore.
# condor_status slot2@xxxxxxxxxxxx -af CondorVersion Gpus DetectedGpus
$CondorVersion: 23.8.1 2024-06-27 BuildID: 742100 PackageID: 23.8.1-1 GitSHA: 8cf018d1 $ 1 GPU-c659279d, GPU-c659279d
]# condor_config_val GPU_DISCOVERY_EXTRA MACHINE_RESOURCE_INVENTORY_GPUs
-repeat 2 -divide 2
/usr/libexec/condor/condor_gpu_discovery -properties -repeat 2 -divide 2
There are 2 detected gpus but only one is shown by the condor_status command. I am searching for information about the 23.8.1 release, but I could not find any change related to condor_gpu_discovery:
Is this a bug or does something new have to be added in the config for divide/repeat options to work again?
Thank you in advance.
Cheers,
Carles
--
Carles Acosta i SilvaPIC (Port d'Informacià CientÃfica)Campus UAB, Edifici DE-08193 Bellaterra, BarcelonaTel: +34 93 581 33 08Fax: +34 93 581 41 10AvÃs - Aviso - Legal Notice: http://legal.ifae.es
--
Carles Acosta i SilvaPIC (Port d'Informacià CientÃfica)Campus UAB, Edifici DE-08193 Bellaterra, BarcelonaTel: +34 93 581 33 08Fax: +34 93 581 41 10AvÃs - Aviso - Legal Notice: http://legal.ifae.es
_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/
-- Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison Center for High Throughput Computing Department of Computer Sciences Calendar: https://tinyurl.com/yd55mtgd 1210 W. Dayton St. Rm #4257 Phone: (608) 263-7132 Madison, WI 53706-1685
-- Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison Center for High Throughput Computing Department of Computer Sciences Calendar: https://tinyurl.com/yd55mtgd 1210 W. Dayton St. Rm #4257 Phone: (608) 263-7132 Madison, WI 53706-1685
--
Carles Acosta i SilvaPIC (Port d'Informacià CientÃfica)Campus UAB, Edifici DE-08193 Bellaterra, BarcelonaTel: +34 93 581 33 08Fax: +34 93 581 41 10AvÃs - Aviso - Legal Notice: http://legal.ifae.es
-- Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison Center for High Throughput Computing Department of Computer Sciences Calendar: https://tinyurl.com/yd55mtgd 1210 W. Dayton St. Rm #4257 Phone: (608) 263-7132 Madison, WI 53706-1685
--
Carles Acosta i SilvaPIC (Port d'Informacià CientÃfica)Campus UAB, Edifici DE-08193 Bellaterra, BarcelonaTel: +34 93 581 33 08Fax: +34 93 581 41 10AvÃs - Aviso - Legal Notice: http://legal.ifae.es
-- Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison Center for High Throughput Computing Department of Computer Sciences Calendar: https://tinyurl.com/yd55mtgd 1210 W. Dayton St. Rm #4257 Phone: (608) 263-7132 Madison, WI 53706-1685