[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] refreshGSIProxy stopped working



Let me add:

back in December we updated our AP's from 24.0.6 to 25.0.3
because all of a sudden all remote submissions were failing
with an error in initial proxy delegation [1] when an AP was rebooted.
Most likely some system library involved in delegation was updated.
We could not figure out details, but Marco Mascheroni (rightly)
suggested me to reproduce on latest HTC before asking for help.
Since v25.0.3 fixed the submission problem, we updated all AP's,
but at the time did not check the proxy renewal :-(

So rolling back is not an option. I tried on an empty test AP
and got the same error as [1]

Stefano

[1] details in https://github.com/dmwm/CRABServer/issues/9245

On 14/01/2026 15:07, Stefano Belforte wrote:
Dear HTCondor developers,

we have a critical problem with CRAB. We just notice that all calls to
python API refreshGSIProxy() which were of course working "before" now fail
raising htcondor.HTCondorIOError.

Due to holidays we only noticed today.

The call is made in docker image build with HTCondor 24.7.3 using v1 API, which
had been working fine for us for years.

We do not have logs going back more than a month on that server, but the error is there at least since Dec 12 and we suspect that it is due to updating the AP's
from 24.7 to 25.0.3 which we did at around that date.

Does this sound a possibility to you ?

We tried to use v2 API in the code which calls refreshGSIProxy(), but

1. there appears to be bugs in the code [1]
2. after fixing those the call raises an htcondor2_impl.HTCondorException with no further details

This means that our server is not able to refresh proxies on AP's. The initial proxy is valid for 7 days and during holidays most (all?) tasks completed in less than that so
the problem was not noticed. But clearly we can not operate like this.

What suggestions do you have to get out of this situation ?
IIUC we can not roll back HTC version on the AP's live, like we do for updates, we'd need to fully drain the scheduler and re-install. Fully draining takes 1 month usually, but 1 week now if we can't renew X509 proxy. Yet, we would rather avoid !

Please.. help !

Stefano

[1] https://github.com/htcondor/htcondor/blob/3ef80065b75a4a70a09e274c37cc1d27e5fb1e50/bindings/python/htcondor2/_schedd.py#L914-L926 the initial self is missing in the argument list in line 914
and in line 926Â int(proxy) should be int(proc)