[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Delay for republishing (woker) nodes when collector restarts?



Hi all,

TLDR: When a collector restarts, it takes roughly ten times longer than we expected for `condor_status` to report the entire cluster again. Is there a way for us to tweak that?

Weâve recently changed our update policy in our cluster and as a side-effect looked more closely at how nodes behave on restart. Specifically for restarting the Collector, we have seen a much bigger effect than expected and are wondering how to mitigate that.

We run two Collectors plus Negotiators in an HA setup, with roughly 250 StartDs and half a dozen SchedDs. In the configs, all UPDATE_INTERVAL and UPDATE_OFFSET variants are the default (i.e. 5 minutes for StartD and SchedD). Thus, when restarting a Collector we would expect a delay of at most 10 minutes before it knows the entire cluster again.
However, what we recently observed was that after about 10 minutes only half the StartDs and SchedDs to the Collector. Most StartDs were known again after about 20 minutes, but the last 10% took about 2h. The SchedDs similarly had a tail of about an hour until all were known again.
At the same time, the other Collector was kept running and stayed aware of all nodes and especially their status changes. Weâve seen the same behaviour with roles swapped, so it does not depend on the machine.

We do use the v1 `htcondor` bindings for getting this data. Manually checking via `condor_status` looks similar.

In the logs of the StartDs [0], we see only that a) the node fails to update the Collector roughly after the restart time and b) it tries (and fails) to reconnect after 10-60 seconds. The StartDs that perform their update a bit later only fail the update and then seem to successfully reconnect right afterwards. There seem to be no further connection attempts or failures logged on any of our nodes.

Is there anything that could explain such a tail of delayed node discovery? Can we tweak this somehow? Failing that, is there a way to estimate when the data is reliable again?

Cheers,
Max

[0] MasterLog
07/21/25 10:05:53 (pid:26064) (D_ALWAYS) condor_write(): Socket closed when trying to write 2978 bytes to collector a01-001-110.gridka.de, fd is 3
07/21/25 10:06:14 (pid:26064) (D_ALWAYS) ERROR: SECMAN:2003:TCP connection to collector a01-001-110.gridka.de failed.

Attachment: smime.p7s
Description: S/MIME cryptographic signature