|
One common cause for Schedd overload is many rapid condor_q queries. Do you see any user processes doing
watch condor_q
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Ram Ban <ramban046@xxxxxxxxx>
Sent: Thursday, February 12, 2026 12:29 PM
To: Jaime Frey <jfrey@xxxxxxxxxxx>
Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Getting Authentication issues on Execute or random Jobs
Yes I am seeing this number much high in a lot of executors( like more than 0.95), but load and ram is fine on Executors and jobs from other submitters are running fine on same executors.
Is there any way to fix this? If any submitter is stuck it degrads my whole pipeline time
Thanks and Regards
Raman
The errors are detected by the client at authentication time because thatâs the first thing that happens on a new connection. But if you totally disable the authentication code, you would see the same error at the next
step in the communication.
This looks like the condor_startds are overloaded and falling way behind on accepting new connections from the schedds/shadows.
You can check whether the daemons are overloaded with this command:
condor_status -af Machine RecentDaemonCoreDutyCycle
In the output, if the number for a machine is close to 1.0 (0.95 or higher), then itâs near or at overload.
- Jaime
Thanks Jamie for the response
What kind of issues it can be because recently when my scale has gone high like 2000 executor machines each has partitionable slots and 20 submitters, 1 master. I am kind of experiencing this issue on a random submitter on which tcp connections
are dropped and condor socket buffer read or write is failed
I am not able to debug this, then I have to stop condor on that and clear log and spool path and it works fine but my jobs are lost and have to resubmit them again.
I have tried increasing Sched, collector workers and file descriptors, with this some things got better but issue was not resolved, in logs I was only seeing authentication issues between that specific submitter with executors/master/other submitters(attached
authentication issue in ss)
Currently I use FS, PASSWORD authentication on submitters and PASSWORD authentication on Executors.
I can also change authentication method if they would be better
Thanks and regards
Raman
<1000083790.png>
Increasing SHARED_PORT_MAX_FILE_DESCRIPTORS may help (the default in later versions was increased to 20000). But if the default of 4096 is causing these errors, it suggests thereâs some deeper problem.
These errors are not related to authentication, but to HTCondorâs machinery that allows multiple daemons (i.e. condor_master, condor_startd) to be contactable via a single TCP port. You could configure HTCondor so that each daemon binds to its own dynamic TCP
port (set USE_SHARED_PORT=False), but that has drawbacks (primarily navigating firewalls).
- Jaime
> On Feb 11, 2026, at 10:26âAM, Ram Ban <ramban046@xxxxxxxxx> wrote:
>
> Hi,
>
> I am seeing random Jobs getting restarted due to hit of Lease time even though other jobs are running fine on executor from same submitter, On investigating condor logs I found out these authentication errors(attached in ss)
>
> Will these be fixed by increasing MAX_FILE_DESCRIPTORS??
>
> Also I run all my machines in my own private network, can I remove this authentication on Execute at least to remove these random issues?
>
> I am running on condor version 10.2.0
> Thanks and regards
> Raman
>
> <1000083751.png>
>
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
The archives can be found at:
https://www-auth.cs.wisc.edu/lists/htcondor-users/
|