HTCondor Project List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] localhost on linux

Date: Thu, 04 Oct 2007 11:22:15 -0500
From: Matthew Farrellee <matt@xxxxxxxxxxx>
Subject: Re: [Condor-devel] localhost on linux

I think #3 is the solution.

#1 isn't quite accurate because the installation is not really broken,it just is not assuming DNS is setup to give the machine a name -- areasonable expectation

#2 this assumes the machine has a name in DNS and none of the domains in/etc/resolv.conf uses wildcard matching, so it's a questionable solution

#3 properly identifies that a collector essentially defines a pool, i.e.the IP that should be relevant is the one that is visible to the collector


#4 doesn't solve this problem, does it?

#5 is not easily portable, and makes a bad assumption that a pool is noton a private network (don't forget 10/8 and 172.16/16!)

A more fundamental question to ask is, why does Condor rely on DNS somuch? Host-based authentication? What else?


Best,


matt

Nick LeRoy wrote:

On Tue September 11 2007, Douglas L Thain wrote:
Howdy all -
Following up on this, as I've been tasked with solving it.
As I'm sure everyone knows by now, Condor gets confused when
Linux has an /etc/hosts with the following configuration:

 	127.0.0.1 hedwig localhost localhost.localdomain

Why does Condor get confused?  It attempts to determine
the local IP address and full DNS name by the following procedure:

1 - Use uname to determine the short hostname. (hedwig)
2 - Use gethostbyname(shortname) to find the IP address. (129.74.162.69)
3 - Use gethostbyaddr(address) to find the full name. (hedwig.cse.nd.edu)

The problem is, step 2 fails if /etc/hosts has the above configuration,
because the IP address is found to be 127.0.0.1, and the full name
is found to be localhost.localdomain.  Hilarity ensues.

The solution (which is suggested in the masterlog) is to
change /etc/hosts to be the following:

 	127.0.0.1 localhost localhost.localdomain

Now, if this was a rare problem, then that's probably enough said.

But, it's not rare any more.  EVERY Linux installation of Condor
that I have seen in the last three years has suffered from this
problem, causing long debugging sessions, frustrated users,
and people giving up because Condor "doesn't work".

May I suggest that it would be a good PR move to find a better
solution to this problem?

Some ideas for discussion:

1 - Instead of logging the problem in an obscure file, modify the Condor
tools to report an obnoxious error message that states your Linux
installation is BROKEN, and here is how to fix it.
Comments:  *Very* easy to implement.
2 - Extract more local information.  If the local uname is not a full
dns name, then infer the containing domain from /etc/resolv.conf,
and try to resolve that combination of names.
I don't think that this is viable... I don't think that we can relyon /etc/resolv.conf -- the host may not have a DNS record that matches it'sview of it's own host name. Also, I don't think that there's a API forwalking through the file. (I could be wrong on this point, however).
3 - Take the naming system out of the loop.  Make a TCP connection to
the collector as if doing a condor_status, then a getsockname to
get the local IP address with respect to the collector.
Not sure what I think about this one. Like Greg's #4 below, this probablydoesn't correctly handle the case in which the CM is also running a startdand/or schedd.
Thoughts?  Ideas?
To these, we could add:
4 - (Greg T) Change BIND_ALL_INTERFACES to default as true. This does solvethe problem in many cases.
5 - Modify the code to walk through all available interfaces, and rank thembased on the "publicness" of it's IP address. 127.* interfaces would get thelowest score, 192.168.* (and related addresses) would get a middle score, andtruly public addresses would get the highest score. Condor would then bindto the interface with the highest score. For Linux, we can gather this infovia the netdevice(7) call -- I don't know yet how we could gather this infoon other platforms (other than, I suppose, running /sbin/ifconfig and parsingit's output). This behavior could be turned off via a configuration knob(yeah, another one), perhaps something like BIND_BEST_INTERFACE = true/false.
#4 doesn't handle the case in which the machine in question is the centralmanager and is running a startd and/or schedd, the ads from these would havethe address of the loopback (over which they would contact the collector).
Thoughts?
Also, I've now modified daemon core and Sock and a couple of other places tohandle loopbacks other than 127.0.0.1 (it now masks it with 255.0.0.0 tocheck whether the given address is a loopback). This is to handle systemslike OpenSuse which now add another address of 127.0.0.2 -- in /etc/hosts,the .1 address is "localhost", and the .2 represents the host name(i.e. "127.0.0.2 myhost"). I haven't committed this code yet, but it doesappear to work.
Thanks

-Nick

Follow-Ups:
- Re: [Condor-devel] localhost on linux
  - From: Greg Thain
- Re: [Condor-devel] localhost on linux
  - From: Dan Bradley

References:
- Re: [Condor-devel] localhost on linux
  - From: Nick LeRoy

Prev by Date: Re: [Condor-devel] localhost on linux
Next by Date: Re: [Condor-devel] localhost on linux
Previous by thread: Re: [Condor-devel] localhost on linux
Next by thread: Re: [Condor-devel] localhost on linux
Index(es):
- Date
- Thread