[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes



> On Jul 30, 2016, at 2:13 PM, Iain Bradford Steers <iain.steers@xxxxxxx> wrote:
> 
> Hi Brian,
> 
> We noticed this on our 8.5.5 CC7 infra nodes (cm, schedds) as well, primarily on start-up.
> 
>> On Jul 30, 2016, at 21:04, Brian Bockelman <bbockelm@xxxxxxxxxxx> wrote:
>> 
>> The two things I can think of that block in the master are:
>> - DNS lookups (the one Andrew originally quoted appears to be inside the security subsystem).
>> - Updating the collector.  Kinda: I suspect that most updates are nonblocking because they buffer in the outgoing TCP socket.  However, when you have to establish a new security sessionâ
> 
> Primarily the second one for us. Whilst conversing with the htcondor team, it was noticed that most of our kills occurred with HTCondor inside relisock doing the initial authentications on start-up.

Indeed: client side authentication is often blocking.  Only the server-side has been made non-blocking.

Something for the TODO list, I suppose.  If we get to the point where only DNS lookups are blocking in the master, then maybe itâs time to take another look at c-ares.

:/  DNS is hard.

> 
> We also had kills in schedds opening an initial security session with execute nodes across a fairly saturated WAN.
> 

Nah, this shouldnât affect the master.  That could be the master killing off the schedd (the latter also has a ton of other blocking behavior).

Brian