Re: [HTCondor-devel] Need for dynamic slots in top-level collector?


Date: Sat, 11 Mar 2017 10:39:56 -0600
From: Greg Thain <gthain@xxxxxxxxxxx>
Subject: Re: [HTCondor-devel] Need for dynamic slots in top-level collector?
On 03/11/2017 10:11 AM, Todd Tannenbaum wrote:

At first blush, looks to me like the network between the CERN central manager and the clients doing queries occasionally becomes congested/slow. When this happens, forked query workers take a really long time to do their thing as evidenced by the massive numbers for the MissedForkRuntime* stats ("MissedFork" = queries that we did in-process because we would exceeded the max number of forked workers).

regards,
Todd

Thanks, Todd. So, "HandleQueryForked" and "HandleQueryForkedMissed" are counts, right? So, we only handled 7 queries inline because of too many forked children out of 20234 requests in this snippet. Of those 7, though, one took 2124 seconds to complete, but of the 20227 forked queries, the slowest was 19 seconds? Perhaps the collector should never handle a query inline, just return a EAGAIN error to the client.

Stats from CERN central manager (after running about about 2 days) :


HandleQueryForked = 20227
HandleQueryForkedRuntimeMax = 19.9297046919819
HandleQueryForkedRuntimeMin = 0.005120144924148917
HandleQueryMissedFork = 7
HandleQueryMissedForkRuntimeMax = 2124.699060306884
HandleQueryMissedForkRuntimeMin = 0.001575499074533582

[← Prev in Thread] Current Thread [Next in Thread→]