HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] [Condor-team] RFC: GCB optimization for local network communication



At 11:23 PM 12/11/2006, Derek Wright wrote:
a condor pool that is GCB enabled will currently always communicate
via the GCB broker, even between 2 machines on the same private
network.  this creates needless performance bottlenecks and potential
failure points.

to fix it, we need daemons to advertise not only their GCB-provided
public IP/port, but also a) their real, local IP/port and b) some
unique network identifier.  lots of machines could be at IP
192.168.2.*, even if they're in totally different networks and have
no way to contact each other directly, so just knowing the real IP
and "i'm 192.168.2.3, and i'm trying to talk to 192.168.2.4" doesn't
tell you if you need GCB or not.


------------------
proposal part 1:
------------------

instead of jumping through lots of hoops to try to uniquely identify
machines or networks in the code, we just punt to the admins.  just
like they have to specify a unique UID_DOMAIN for that stuff to work,
if they're setting up GCB, they have to setup a unique NETWORK_NAME
(exact name TBD) and we just use whatever they say.  if 2 machines
are in the same NETWORK_NAME, they can assume direct communication
and avoid GCB (provided they have the real local IP/port, not just
the public IP/port in the canonical sinful string).

So, let me make sure that I understand what is going on - A "network address" in Condor will consist of three elements - IP/port, NETWORK_NAME, and GCB IP/port. The NETWORK_NAME will be used only when the address has a non-null GCB element, right? In other words, if a NETWORK_NAME is not provided, Condor will assume that the IP/port is in a "GLOBAL NETWORK NAME SPACE". We also assume that the GCB is in this GLOBAL space. If a GCB is not included, should Condor check that both parties are in the same NETWORK_SPACE before attempting to establish a connection. Can we envision cases where NETWORK_NAME will be used in a requirement expression?



------------------
proposal part 2:
------------------

how will daemons know the net_name + real ip/port?  one avenue i've
been investigating is to modify the format of sinful strings, and
include all this additional info.  something like:

StartdIpAddr = "<public_ip:port><network_name:local_ip:port>"

sadly, there are 698 call sites that reference "sinful" in our
source, and an additional 137 that use one of the sinful-string
related helper functions (sin_to_string, string_to_port, etc, etc).
after spending quite a bit of time looking at this code, it's clear
we're basically doomed if we change the format of the strings like
this.  old daemons *will* seg fault (static buffers in DaemonCore,
among many), if the size of the sinful string more than doubles.  a
lot of code will do utterly wrong things. :(

so, we have 2 real options:

a) have our "network incompatibility flag day", declare that 6.9.x is

I see many (good) reason to wait until 7.0 with such a change.

utterly incompatible with everything before it, and change the format
of the sinful string however we want.  while we're at it, we'd
probably change the names of the classad attributes, so we just use
"MyIpAddr" everywhere, instead of "StartdIpAddr" vs. "ScheddIpAddr",
etc, etc.  we could also rip out at least 1000 lines of code, maybe
more, of cruft/bloat from our varied attempts to maintain backwards
compatibility.

We should do it once we have experience with how b) works.


b) forget about changing the existing sinful-string related
attributes and functions, and handle this GCB optimization with a
brand new classad attr, something like:

RealNetworkId = "<network_name:ip:port>"

Why not just to add an attribute for NETWORK_NAME and let the Condor software do the rest internally?


this would be the admin-specified network_name, and the real local IP/ port. then, we'd just have to incrementally change parts of the
Condor code to make use of this new attribute and do the

Yes, this is the way to go. Moreover, it will help us understand how people use this capability and what kind of "creative" ideas they have.

optimization.  it seems like with relatively small changes (mostly to
DaemonCore and DaemonClient) we could handle a major portion of the

Good!

network communications.  we might miss some outlying cases in the
first pass, but we could fix those in stages.  everything would

I believe that our mmain GCB users at this point are doing it with Glide-ins, right?

continue to be compatible, and would work... it's just a question of
if a given connection could use this optimization to skip talking to
the GCB broker or not.

Should this be the question or should we ask whether a connection requires a GCB in order to be established?



option (a) certainly has a lot of appeal, but it's a rather huge
change for what is ultimately a pretty small subset of our users.
i'd still *love* to purge as much compatibility cruft and bloat as
possible, but this might not be the best time/reason to do so.

given all the facts, i'm voting for b.  i'd like a decision ASAP so
we can try to get as much of this done this week while i'm in town.

I vote (b).

Miron




thanks,
-derek