[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Strange behavior with GPU match making in 24.0.1



It was added ton condor_submit in 23.5,  and it exhibited this exact partial failure mode until 23.8, when the STARTD d-slot creation code was changed to handle require_gpus expressions that reference job attributes. 

-tj

From: Anderson, Stuart B. <sba@xxxxxxxxxxx>
Sent: Friday, November 8, 2024 2:51 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: John M Knoeller <johnkn@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Strange behavior with GPU match making in 24.0.1
 

> On Nov 8, 2024, at 11:46âAM, John M Knoeller via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
>
> A STARTD that is older than 24.0 does not handle the new first-class
> gpus submit commands like gpus_minimum_memory correctly, so jobs will match to slots that exist, but it will fail to create a new dynamic slot when the job is using one of those commands.
>
> The fix is to upgrade your execute nodes.

I thought this was added back in 23.5.x?

Thanks.

â
Stuart Anderson
sba@xxxxxxxxxxx