On 03/11/2017 10:11 AM, Todd Tannenbaum wrote:
At first blush, looks to me like the network between the CERN central
manager and the clients doing queries occasionally becomes
congested/slow. When this happens, forked query workers take a really
long time to do their thing as evidenced by the massive numbers for
the MissedForkRuntime* stats ("MissedFork" = queries that we did
in-process because we would exceeded the max number of forked workers).
regards,
Todd
Thanks, Todd. So, "HandleQueryForked" and "HandleQueryForkedMissed" are
counts, right? So, we only handled 7 queries inline because of too many
forked children out of 20234 requests in this snippet. Of those 7,
though, one took 2124 seconds to complete, but of the 20227 forked
queries, the slowest was 19 seconds? Perhaps the collector should never
handle a query inline, just return a EAGAIN error to the client.
Stats from CERN central manager (after running about about 2 days) :
HandleQueryForked = 20227
HandleQueryForkedRuntimeMax = 19.9297046919819
HandleQueryForkedRuntimeMin = 0.005120144924148917
HandleQueryMissedFork = 7
HandleQueryMissedForkRuntimeMax = 2124.699060306884
HandleQueryMissedForkRuntimeMin = 0.001575499074533582
|