HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-devel] alternatives to #define for GCB support



disclaimer: this is long, somewhat complicated, and probably
uninteresting to most of you.  however, i figured i should err on the
side of including more people so that folks with bright ideas could
chime in, or interested parties could learn something.  if you don't
care, don't know, or are too busy, please ignore this message. don't
feel the need to reply, since todd and sonny and i are already working
on it.  but, if anyone a) has a chance to read this and b) has any
comments, please let me know.  thanks,

-d

p.s. 10,000 foot view of the problem: to make condor work with GCB, we
need to convert all syscalls that touch network sockets to use
gcb-aware versions of those methods.

p.p.s. i'll include sonny's original message on why not use #define at
the very bottom of this for the interested reader...


------- Forwarded Message

To: Se-Chang Son <sschang@xxxxxxxxxxx>
Subject: alternatives to #define for GCB support
cc: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Date: Tue, 03 May 2005 17:21:00 -0500
From: Derek Wright <wright@xxxxxxxxxxxxxxxxx>


i sent todd your email about why you don't think #define will work for
converting condor to use GCB.  we agree.  however, we brainstormed
some other possible solutions that we wanted to talk with you about.
are you free on friday (5/6) at 1pm CDT to talk about this stuff with
todd and myself?  you could just go to todd's office and y'all can
call me.  let me know if that time/day works ok for you (that's the
first time todd was available).

in the meantime, here's some stuff to think about...

there are 2 giant problems with "porting" condor to use GCB by the
method currently checked into the latest GCB branch:

1) it's a huge code-maintainance pain.  we'll have tons of cvs
   conflicts.  it will be easy for people to forget to use the
   "Generic_*" version of a method and break GCB support in a subtle
   way that would be very difficult to diagnose and solve, etc, etc.

2) it doesn't work at all with externals we link against (e.g. globus,
   kerberos, etc).  for example, we can't claim condor is "GCBized" if
   once you turn on kerberos authentication, there are socket calls
   that don't know anything about GCB. :(  having a GCB that only
   works if you don't use globus or kerberos is worse than a GCB that
   only works on a few platforms.

todd and i came up with 2 possible alternatives.  they both address
these problems, though each has problems of their own. ;) i wrote up a
summary of our discussion and the proposed alternatives, including a
list of known pros + cons (as i see them).  please read this over and
let me know what you think.  if you agree that these seem reasonable,
i'm hoping you can spend some time this week "playing" with the
proposals and see in practice how difficult or easy they would be to
implement.  sound good?  thanks!

-d

------------------------------------------------------------

1) make libsocketstubs.a that we put on the link line right before we
   include libc.a, libsocket.a, etc.  this would basically be what
   libgeneric.a does now, except all the functions would be named with
   the "real" function names.  we wouldn't change any code in condor,
   we'd just let the linker resolve the functions correctly.  in
   libsocketstubs.a, we'd have seperate .o files for each function,
   and those would look something like:

int
send( int s, const void *msg, size_t len, int flags )
{
    if( is_gcb() ) {
        return gcb_send( s, msg, len, flags );
    } else {
        return SEND( s, msg, len, flags );
    }
}


the SEND() method would be "extracted" from the real libc.a or
libsocket.a and we'd use ToUpper to convert the symbol to the all-caps
version.  so, we'd also have a libsocketreal.a (or something) that
contained all the .o files for all the all-caps versions of each
method.  this is exactly the trick we use in the standard universe for
places in the ckpting code where we need to locally call a given
system call, instead of having it go remotely.

in the few cases (like the checkpointing code) where we don't always
want the GCB versions, we wouldn't link those things using
libsocketstubs.a and libsocketreal.a, we'd modify the code to directly
call the Generic_* version in the appropriate places.  


pros:
- works with the externals
- means we only have to modify link-lines, not condor source, except
  in the few special cases where we do *not* want GCB calls.
- works fine with c++ methods that have the same names, since the
  linker can obvious tell the difference between "read()" and
  "ReliSock::read()", etc, etc.

cons:
- major headache getting libgcbstubs.a to work on all platforms.  for
  example, if a given vendor's send.o file in libc.a defines send()
  and _send(), we'd need our stub to do the same thing.  getting this
  right for a given platform is about 1/4 of the total work of making
  a full vs. clipped port of condor. :(
- wouldn't work on solaris, where libsocket only comes as a dynamic
  library and we couldn't extract .o files from that like this.



2) GNU linker tricks

the GNU linker (ld) supports a "--wrap" option that provides a nice
way to have the linker itself basically handle the hard work of option
#1.  basically, if you do this:

"ld -o foo foo.o bar.o baz.o --wrap send -lc -lsocket"

then the linker will replace all unresolved references to "send()"
with calls to "__wrap_send()" instead.  the "real" version of send()
will be made available as "__real_send()".  so, in our case, the
proposed libsocketstub.a library would be full of "__wrap" methods
that looked something like this:


int
__wrap_send( int s, const void *msg, size_t len, int flags )
{
    if( is_gcb() ) {
        return gcb_send( s, msg, len, flags );
    } else {
        return __real_send( s, msg, len, flags );
    }
}

in this case, we wouldn't need libsocketreal.a at all (and all the
headache it would require to maintain such a thing), and it wouldn't
matter where on the link line we put libsocketstubs.a, since the
linker itself would convert all references to "send()" to call
"__wrap_send()" instead, and we can be sure that no condor, system, or
external library is going to define that.

again, in the few source files where we specifically do *NOT* want to
use GCB, we could either modify the source to directly call
__real_send() or gcb_send() as appropriate, we could define our own
version of __wrap_send() that only called __real_send(), etc, etc.

you can read the GNU ld man page for details on all of this.


pros:
- works with the externals
- means we only have to modify link-lines, not condor source (except
  for the exceptions). ;)
- avoids all the trouble of extracting .o files, ToUpper, maintaining
  all these stubs, etc. 
- autoconf could determine if we're using the GNU linker on a given
  platform and the build system could enable all this automatically
  for the link line.

cons:
- only works on platforms where we can use the GNU linker to build
  condor.  this is unfortunate, but we think it's better to have
  maintainable GCB support on a subset of platforms than to have a
  nightmare on all platforms. ;)  if GCB becomes widly used and it
  becomes a problem that we don't support it on a platform where we
  can't use GNU ld, we can look into a hybrid approach of #1 and #2
  where we mostly use #2, but manually maintain a #1-style solution on
  the few platforms where #2 is impossible.


------------------------------------------------------------

------- End of Forwarded Message


------- Forwarded Message

Date:    Thu, 11 Mar 2004 12:08:22 CST
To:      Derek Wright <wright@xxxxxxxxxxx>
cc:      Todd Tannenbaum <tannenba@xxxxxxxxxxx>
From:    Se-Chang Son <sschang@xxxxxxxxxxx>
Subject: Re: getting GCB into 6.7.x


Hi Derek,

Since Todd is out of touch at the moment, let me bring this issue to you.

Todd and I decided to use the macro approach to make Condor GCBnized. 
The basic idea is as follows:
  1) create a header file that looks like:
     #define socket Generic_socket
     #define bind Generic_bind
     ...
  2) include the header file of 1) in "condor_common.h"
  3) Generic_* functions check environment variables and call either
     GCB_* or regular socket functions

With this approach, we believed that we can make Condor GCB-aware by 
changing relatively small number of places and this will help you merge 
my changes to 6.7.0 easily.

However, I believe that it is becoming clear that the decision was 
wrong. Some reasons are:

1) Pete says that some platforms implement socket calls as macro. If 
this is true, our approach may not work or extremely difficult to make 
it work.
2) We have many classes whose member functions have the same name as 
socket calls. Few examples are cedar, c++_util/memory_file, 
ckpt/CondorFile (and all inherited classes), and ckpt/CondorFileTable. 
To handle this, I have to rename those methods. Unfortunately, some of 
these classes are used by the stub generator.
3) Some parts of Condor use 'ifstream' and 'ofstream' classes that 
happen to have the member functions with the same name I am trying to 
redefine.
4) Some parts of Condor must not use GCB. For example, checkpoint 
routine must not use GCB functions after calculating the image size.

Considering these cases and maybe others I have not found yet, I believe 
that this macro approach is no simpler for me, you, and team members 
than changing each socket call into Generic call.

Do you think I still wait for Todd's decision?

How hard will it be to merge my changes to 6.7 if I take the alternative 
way that replaces socket calls with Generic calls case by case?

Thank you.


------- End of Forwarded Message