HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] localhost on linux



I think #3 is the solution.

#1 isn't quite accurate because the installation is not really broken, it just is not assuming DNS is setup to give the machine a name -- a reasonable expectation

#2 this assumes the machine has a name in DNS and none of the domains in /etc/resolv.conf uses wildcard matching, so it's a questionable solution

#3 properly identifies that a collector essentially defines a pool, i.e. the IP that should be relevant is the one that is visible to the collector

#4 doesn't solve this problem, does it?

#5 is not easily portable, and makes a bad assumption that a pool is not on a private network (don't forget 10/8 and 172.16/16!)

A more fundamental question to ask is, why does Condor rely on DNS so much? Host-based authentication? What else?

Best,


matt

Nick LeRoy wrote:
On Tue September 11 2007, Douglas L Thain wrote:
Howdy all -

Following up on this, as I've been tasked with solving it.

As I'm sure everyone knows by now, Condor gets confused when
Linux has an /etc/hosts with the following configuration:

 	127.0.0.1 hedwig localhost localhost.localdomain

Why does Condor get confused?  It attempts to determine
the local IP address and full DNS name by the following procedure:

1 - Use uname to determine the short hostname. (hedwig)
2 - Use gethostbyname(shortname) to find the IP address. (129.74.162.69)
3 - Use gethostbyaddr(address) to find the full name. (hedwig.cse.nd.edu)

The problem is, step 2 fails if /etc/hosts has the above configuration,
because the IP address is found to be 127.0.0.1, and the full name
is found to be localhost.localdomain.  Hilarity ensues.

The solution (which is suggested in the masterlog) is to
change /etc/hosts to be the following:

 	127.0.0.1 localhost localhost.localdomain

Now, if this was a rare problem, then that's probably enough said.

But, it's not rare any more.  EVERY Linux installation of Condor
that I have seen in the last three years has suffered from this
problem, causing long debugging sessions, frustrated users,
and people giving up because Condor "doesn't work".

May I suggest that it would be a good PR move to find a better
solution to this problem?

Some ideas for discussion:

1 - Instead of logging the problem in an obscure file, modify the Condor
tools to report an obnoxious error message that states your Linux
installation is BROKEN, and here is how to fix it.

Comments:  *Very* easy to implement.

2 - Extract more local information.  If the local uname is not a full
dns name, then infer the containing domain from /etc/resolv.conf,
and try to resolve that combination of names.

I don't think that this is viable... I don't think that we can rely on /etc/resolv.conf -- the host may not have a DNS record that matches it's view of it's own host name. Also, I don't think that there's a API for walking through the file. (I could be wrong on this point, however).


3 - Take the naming system out of the loop.  Make a TCP connection to
the collector as if doing a condor_status, then a getsockname to
get the local IP address with respect to the collector.

Not sure what I think about this one. Like Greg's #4 below, this probably doesn't correctly handle the case in which the CM is also running a startd and/or schedd.

Thoughts?  Ideas?

To these, we could add:

4 - (Greg T) Change BIND_ALL_INTERFACES to default as true. This does solve the problem in many cases.


5 - Modify the code to walk through all available interfaces, and rank them based on the "publicness" of it's IP address. 127.* interfaces would get the lowest score, 192.168.* (and related addresses) would get a middle score, and truly public addresses would get the highest score. Condor would then bind to the interface with the highest score. For Linux, we can gather this info via the netdevice(7) call -- I don't know yet how we could gather this info on other platforms (other than, I suppose, running /sbin/ifconfig and parsing it's output). This behavior could be turned off via a configuration knob (yeah, another one), perhaps something like BIND_BEST_INTERFACE = true/false.


#4 doesn't handle the case in which the machine in question is the central manager and is running a startd and/or schedd, the ads from these would have the address of the loopback (over which they would contact the collector).


Thoughts?

Also, I've now modified daemon core and Sock and a couple of other places to handle loopbacks other than 127.0.0.1 (it now masks it with 255.0.0.0 to check whether the given address is a loopback). This is to handle systems like OpenSuse which now add another address of 127.0.0.2 -- in /etc/hosts, the .1 address is "localhost", and the .2 represents the host name (i.e. "127.0.0.2 myhost"). I haven't committed this code yet, but it does appear to work.

Thanks

-Nick