Re: [HTCondor-devel] Need for dynamic slots in top-level collector?


Date: Fri, 10 Mar 2017 11:13:15 -0600
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [HTCondor-devel] Need for dynamic slots in top-level collector?
On 3/10/2017 10:44 AM, Greg Thain wrote:
On 03/10/2017 10:33 AM, Brian Bockelman wrote:
Is it safe to assume that 25% of the updates are d-slot private ads?
Can we drop those?

We can drop all the private ads for claimed slots (dynamic or static),
but the private ads are very small compared to the public ads, so I
don't know how much of a difference that would make.

-greg

We can put the information contained in the private ads into the corresponding public ads as private attributes. But as Greg says, not sure that it would help much. And if preemption is disabled, as Greg noted above, we don't need any private ads for dynamic slots (as they are all claimed).

In v8.7.0 we have additional statistics in the collector ad about where the top-level collector is spending time and stats on the number of fork workers.

I think we pursue the following:

1. Once the collector has reached its forked child worker limit, it should queue up additional query requests and service them as child workers exit. We could do this for v8.7.1 (in fact, I am hoping to do this next week, assuming the new stats in v8.7.0 show that this is going to help).

2. The collector should only respond to queries in-process if we "know" the response will be faster than forking. Right now we make the decision to fork or not based on the table being queried. I propose we make the following changes: (a) collector should respond in-process only if the query is for a small table AND the query has a projection of less than X attributes, and (b) collector in-process results need to sent back to the client using non-blocking I/O. Item (a) is trivial and could happen for v8.7.1; item (b) is a bit more involved, but not too bad, since happily the collector only does one end-of-message after sending all the responses, so a non-blocking relisock can happily buffer the response (at the cost of RAM) without needing to deal with moving to shared or weak pointers to ads in the collector.

3. If and only of preemption is disabled, then: (1) the accountant could get accounting information out of pslot roll-up information, so child collectors could avoid sending dslots to the parent, and (2) no need for children to forward private ads for slots in Claimed state.

3. Make the "collector tree" central manager setup a first-class configuration that only requires the admin to state something simple like the max size of their pool and/or the number of child collectors. If HTCondor always configures the collector tree in a specific manner, we can leverage that to our benefit instead of trying to make things better given any possible way folks could set things up. We could, for instance, always setup two top-level collectors, one just for operations (the negotiator) and one or more for monitoring (condor_status). (Yes, this is trading off RAM for performance). We could have the shared_port forward updates to specific child collectors (removing the complexity of a collector-tree config at the startd/schedd machines). We could have CCB always in a separate set of processes. Etc Etc.

regards,
Todd


--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685
[← Prev in Thread] Current Thread [Next in Thread→]