[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] repeat/divide GPU options not working on condor 23.8.1//23.9.6?



Hi Carles,

Regarding the below:

While trying to reproduce the problem, we found and fixed a bug involving the "-divide" option to condor_gpu_discovery.  Details on this bug fix are here:
   https://opensciencegrid.atlassian.net/browse/HTCONDOR-2669

However, we were not able to reproduce the core issue you describe below.  Could you please re-run your test below, but this time use the following condor_status command which will give us more information:

   condor_status -cons 'Machine=="gpu03.pic.es"' -af:h Name SlotType GPUs DetectedGpus AssignedGpus JobStatus

Specifically, the above command will give us additional information about all of the slots on the server (not just slot2), which will allow us to better judge if there is a problem and/or how to reproduce it. 

Thank you,
Todd


On 10/4/2024 12:44 AM, Carles Acosta wrote:
Hi again,

I have checked it with version 23.10.1 and the problem persists.

Cheers,

Carles

On Wed, 2 Oct 2024 at 13:52, Carles Acosta <cacosta@xxxxxx> wrote:
Hi,

I have a machine with 1 GPU but we added the -divide 2 -reset 2 options in GPU_DISCOVERY_EXTRA to offer 2 GPUs. This was running fine on 23.0.12 and up to 23.7.2.

# condor_status slot2@xxxxxxxxxxxx -af CondorVersion Gpus DetectedGpus
$CondorVersion: 23.0.12 2024-06-13 BuildID: 739441 PackageID: 23.0.12-1 $ 2 GPU-c659279d, GPU-c659279d
# condor_config_val GPU_DISCOVERY_EXTRA MACHINE_RESOURCE_INVENTORY_GPUs
-repeat 2 -divide 2
/usr/libexec/condor/condor_gpu_discovery  -properties -repeat 2 -divide 2

However, if we update to 23.8.1 or 23.9.6 this is not working anymore. 

# condor_status slot2@xxxxxxxxxxxx -af CondorVersion Gpus DetectedGpus
$CondorVersion: 23.8.1 2024-06-27 BuildID: 742100 PackageID: 23.8.1-1 GitSHA: 8cf018d1 $ 1 GPU-c659279d, GPU-c659279d
]# condor_config_val GPU_DISCOVERY_EXTRA MACHINE_RESOURCE_INVENTORY_GPUs
-repeat 2 -divide 2
/usr/libexec/condor/condor_gpu_discovery  -properties -repeat 2 -divide 2

There are 2 detected gpus but only one is shown by the condor_status command. I am searching for information about the 23.8.1 release, but I could not find any change related to condor_gpu_discovery:


Is this a bug or does something new have to be added in the config for divide/repeat options to work again?

Thank you in advance.

Cheers,

Carles

--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
AvÃs - Aviso - Legal Notice:  http://legal.ifae.es


--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
AvÃs - Aviso - Legal Notice:  http://legal.ifae.es

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx>  University of Wisconsin-Madison
Center for High Throughput Computing    Department of Computer Sciences
Calendar: https://tinyurl.com/yd55mtgd  1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                   Madison, WI 53706-1685 


-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx>  University of Wisconsin-Madison
Center for High Throughput Computing    Department of Computer Sciences
Calendar: https://tinyurl.com/yd55mtgd  1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                   Madison, WI 53706-1685