Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Delay for republishing (woker) nodes when collector restarts?

Date: Wed, 23 Jul 2025 20:03:54 +0000
From: KÃhn, Max (SCC) <max.fischer@xxxxxxx>
Subject: Re: [HTCondor-users] Delay for republishing (woker) nodes when collector restarts?

Hi Christoph,

Thanks for the pointers, but I see no indication that the Collector is actually overloaded.

After the restart thereâs a sizeable surplus of available workers [0] even while all the nodes report back. Iâm not sure if the updates count as QUERY, though â workers only get forked for QUERY_MULTIPLE_PVT_ADS, QUERY_STARTD_PVT_ADS and QUERY_STARTD_PVT_ADS.
The collector classed also lists all `UpdatesLost` etc. as 0, even after quite a while.

Is there some specific error I might look for?

Cheers,
Max

[0]
07/23/25 21:29:27 (pid:491495) (D_ALWAYS) Got QUERY_SCHEDD_ADS
07/23/25 21:29:27 (pid:491495) (D_ALWAYS) QueryWorker: forked new worker with id 492006 ( max 32 active 1 pending 0 )

> On 23. Jul 2025, at 14:42, Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:
> 
> Hi Max,
> 
> that sounds a bit laggy to me ;) 
> 
> Is your collector overly busy at the time ? 
> 
> Have you checked for the number of workers spawned while all the startd's are coming back (Collector.Log) ? 
> 
> COLLECTOR_QUERY_WORKERS is the knob to start mor of those if needed ...
> 
> Best
> christoph
> 
> 
> -- 
> Christoph Beyer
> DESY Hamburg
> IT-Department
> 
> Notkestr. 85
> Building 02b, Room 009
> 22607 Hamburg
> 
> phone:+49-(0)40-8998-2317
> mail: christoph.beyer@xxxxxxx
> 
> ----- UrsprÃngliche Mail -----
> Von: "Max Fischer, SCC" <max.fischer@xxxxxxx>
> An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
> Gesendet: Mittwoch, 23. Juli 2025 14:23:20
> Betreff: [HTCondor-users] Delay for republishing (woker) nodes when	collector restarts?
> 
> Hi all,
> 
> TLDR: When a collector restarts, it takes roughly ten times longer than we expected for `condor_status` to report the entire cluster again. Is there a way for us to tweak that?
> 
> Weâve recently changed our update policy in our cluster and as a side-effect looked more closely at how nodes behave on restart. Specifically for restarting the Collector, we have seen a much bigger effect than expected and are wondering how to mitigate that.
> 
> We run two Collectors plus Negotiators in an HA setup, with roughly 250 StartDs and half a dozen SchedDs. In the configs, all UPDATE_INTERVAL and UPDATE_OFFSET variants are the default (i.e. 5 minutes for StartD and SchedD). Thus, when restarting a Collector we would expect a delay of at most 10 minutes before it knows the entire cluster again.
> However, what we recently observed was that after about 10 minutes only half the StartDs and SchedDs to the Collector. Most StartDs were known again after about 20 minutes, but the last 10% took about 2h. The SchedDs similarly had a tail of about an hour until all were known again.
> At the same time, the other Collector was kept running and stayed aware of all nodes and especially their status changes. Weâve seen the same behaviour with roles swapped, so it does not depend on the machine.
> 
> We do use the v1 `htcondor` bindings for getting this data. Manually checking via `condor_status` looks similar.
> 
> In the logs of the StartDs [0], we see only that a) the node fails to update the Collector roughly after the restart time and b) it tries (and fails) to reconnect after 10-60 seconds. The StartDs that perform their update a bit later only fail the update and then seem to successfully reconnect right afterwards. There seem to be no further connection attempts or failures logged on any of our nodes.
> 
> Is there anything that could explain such a tail of delayed node discovery? Can we tweak this somehow? Failing that, is there a way to estimate when the data is reliable again?
> 
> Cheers,
> Max
> 
> [0] MasterLog
> 07/21/25 10:05:53 (pid:26064) (D_ALWAYS) condor_write(): Socket closed when trying to write 2978 bytes to collector a01-001-110.gridka.de, fd is 3
> 07/21/25 10:06:14 (pid:26064) (D_ALWAYS) ERROR: SECMAN:2003:TCP connection to collector a01-001-110.gridka.de failed.
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> 
> The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> 
> The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/

Attachment: smime.p7s
Description: S/MIME cryptographic signature

References:
- [HTCondor-users] Delay for republishing (woker) nodes when collector restarts?
  - From: KÃhn, Max (SCC)
- Re: [HTCondor-users] Delay for republishing (woker) nodes when collector restarts?
  - From: Beyer, Christoph

Prev by Date: [HTCondor-users] sudo condor_reconfig failing with error about condor@password
Next by Date: Re: [HTCondor-users] When to expect Debian trixie packages for 24.x lts?
Previous by thread: Re: [HTCondor-users] Delay for republishing (woker) nodes when collector restarts?
Next by thread: [HTCondor-users] sudo condor_reconfig failing with error about condor@password
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Delay for republishing (woker) nodes when collector restarts?