Re: [HTCondor-users] Getting Authentication issues on Execute or random Jobs

Yes I am seeing this number much high in a lot of executors( like more than 0.95), but load and ram is fine on Executors and jobs from other submitters are running fine on same executors.

Is there any way to fix this? If any submitter is stuck it degrads my whole pipeline time

Thanks and Regards

Raman

On Thu, Feb 12, 2026, 23:41 Jaime Frey <jfrey@xxxxxxxxxxx> wrote:

The errors are detected by the client at authentication time because thatâs the first thing that happens on a new connection. But if you totally disable the authentication code, you would see the same error at the next step in the communication.

This looks like the condor_startds are overloaded and falling way behind on accepting new connections from the schedds/shadows.

You can check whether the daemons are overloaded with this command:

condor_status -af Machine RecentDaemonCoreDutyCycle

In the output, if the number for a machine is close to 1.0 (0.95 or higher), then itâs near or at overload.

- Jaime

On Feb 12, 2026, at 11:24âAM, Ram Ban <ramban046@xxxxxxxxx> wrote:

Thanks Jamie for the response

What kind of issues it can be because recently when my scale has gone high like 2000 executor machines each has partitionable slots and 20 submitters, 1 master. I am kind of experiencing this issue on a random submitter on which tcp connections are dropped and condor socket buffer read or write is failed

I am not able to debug this, then I have to stop condor on that and clear log and spool path and it works fine but my jobs are lost and have to resubmit them again.

I have tried increasing Sched, collector workers and file descriptors, with this some things got better but issue was not resolved, in logs I was only seeing authentication issues between that specific submitter with executors/master/other submitters(attached authentication issue in ss)

Currently I use FS, PASSWORD authentication on submitters and PASSWORD authentication on Executors.

I can also change authentication method if they would be better

Thanks and regards

Raman

<1000083790.png>

On Thu, Feb 12, 2026, 22:24 Jaime Frey via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

Increasing SHARED_PORT_MAX_FILE_DESCRIPTORS may help (the default in later versions was increased to 20000). But if the default of 4096 is causing these errors, it suggests thereâs some deeper problem.

These errors are not related to authentication, but to HTCondorâs machinery that allows multiple daemons (i.e. condor_master, condor_startd) to be contactable via a single TCP port. You could configure HTCondor so that each daemon binds to its own dynamic TCP port (set USE_SHARED_PORT=False), but that has drawbacks (primarily navigating firewalls).

- Jaime

> On Feb 11, 2026, at 10:26âAM, Ram Ban <ramban046@xxxxxxxxx> wrote:
>
> Hi,
>
> I am seeing random Jobs getting restarted due to hit of Lease time even though other jobs are running fine on executor from same submitter, On investigating condor logs I found out these authentication errors(attached in ss)
>
> Will these be fixed by increasing MAX_FILE_DESCRIPTORS??
>
> Also I run all my machines in my own private network, can I remove this authentication on Execute at least to remove these random issues?
>
> I am running on condor version 10.2.0
> Thanks and regards
> Raman
>
> <1000083751.png>
>

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/