Re: [HTCondor-devel] Need for dynamic slots in top-level collector?


Date: Sat, 11 Mar 2017 10:11:33 -0600
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [HTCondor-devel] Need for dynamic slots in top-level collector?

FYI, below are some statistics on how long it took to respond to various types of queries, courtesy new statistics monitoring added into v8.7.0.

For comparison, I put stats below from both the CERN global cms pool central manager, and from UW CHTC's central manager. I got these stats via a command like condor_status -pool central-manager.host.edu -collector -statistics all:2 -l | egrep '(^Handle.*|ForkWorkers)'

At first blush, looks to me like the network between the CERN central manager and the clients doing queries occasionally becomes congested/slow. When this happens, forked query workers take a really long time to do their thing as evidenced by the massive numbers for the MissedForkRuntime* stats ("MissedFork" = queries that we did in-process because we would exceeded the max number of forked workers).

regards,
Todd

Stats from CERN central manager (after running about about 2 days) :

CurrentForkWorkers = 2
HandleLocate = 948
HandleLocateForked = 0
HandleLocateForkedRuntime = 0.0
HandleLocateMissedFork = 0
HandleLocateMissedForkRuntime = 0.0
HandleLocateRuntime = 4.252691565779969
HandleLocateRuntimeAvg = 0.004485961567278448
HandleLocateRuntimeMax = 0.5942266429774463
HandleLocateRuntimeMin = 0.000327280955389142
HandleLocateRuntimeStd = 0.02825324607232301
HandleQuery = 72690
HandleQueryForked = 20227
HandleQueryForkedRuntime = 8065.561725556618
HandleQueryForkedRuntimeAvg = 0.398752248260079
HandleQueryForkedRuntimeMax = 19.9297046919819
HandleQueryForkedRuntimeMin = 0.005120144924148917
HandleQueryForkedRuntimeStd = 0.2165687619633886
HandleQueryMissedFork = 7
HandleQueryMissedForkRuntime = 3807.289392268052
HandleQueryMissedForkRuntimeAvg = 543.8984846097218
HandleQueryMissedForkRuntimeMax = 2124.699060306884
HandleQueryMissedForkRuntimeMin = 0.001575499074533582
HandleQueryMissedForkRuntimeStd = 862.8426533178198
HandleQueryRuntime = 5011.952447317541
HandleQueryRuntimeAvg = 0.06894968286308352
HandleQueryRuntimeMax = 251.0224419250153
HandleQueryRuntimeMin = 0.0001686080358922482
HandleQueryRuntimeStd = 1.689945545346726
PeakForkWorkers = 16

Stats from CHTC central manager (after running about 3 weeks) :

CurrentForkWorkers = 0
HandleLocate = 77408
HandleLocateForked = 2615
HandleLocateForkedRuntime = 51.6975245885551
HandleLocateForkedRuntimeAvg = 0.01976960787325243
HandleLocateForkedRuntimeMax = 0.04788178578019142
HandleLocateForkedRuntimeMin = 0.007770448923110962
HandleLocateForkedRuntimeStd = 0.00348926807103674
HandleLocateMissedFork = 0
HandleLocateMissedForkRuntime = 0.0
HandleLocateRuntime = 88.02384298294783
HandleLocateRuntimeAvg = 0.001137141419271236
HandleLocateRuntimeMax = 0.1072812303900719
HandleLocateRuntimeMin = 0.0004654340445995331
HandleLocateRuntimeStd = 0.002678180300024122
HandleQuery = 249819
HandleQueryForked = 439004
HandleQueryForkedRuntime = 9062.483967624605
HandleQueryForkedRuntimeAvg = 0.02064328335874982
HandleQueryForkedRuntimeMax = 0.4229466877877712
HandleQueryForkedRuntimeMin = 0.000716477632522583
HandleQueryForkedRuntimeStd = 0.003701340726264902
HandleQueryMissedFork = 9
HandleQueryMissedForkRuntime = 1.984908144921064
HandleQueryMissedForkRuntimeAvg = 0.2205453494356738
HandleQueryMissedForkRuntimeMax = 0.5626467131078243
HandleQueryMissedForkRuntimeMin = 0.0854371152818203
HandleQueryMissedForkRuntimeStd = 0.1904528704960855
HandleQueryRuntime = 862.5680961571634
HandleQueryRuntimeAvg = 0.003452772191695441
HandleQueryRuntimeMax = 0.5754754431545734
HandleQueryRuntimeMin = 0.0002174414694309235
HandleQueryRuntimeStd = 0.009835888478160599
PeakForkWorkers = 10



[← Prev in Thread] Current Thread [Next in Thread→]