On Tue September 11 2007, Douglas L Thain wrote:
Howdy all -
Following up on this, as I've been tasked with solving it.
As I'm sure everyone knows by now, Condor gets confused when
Linux has an /etc/hosts with the following configuration:
127.0.0.1 hedwig localhost localhost.localdomain
Why does Condor get confused? It attempts to determine
the local IP address and full DNS name by the following procedure:
1 - Use uname to determine the short hostname. (hedwig)
2 - Use gethostbyname(shortname) to find the IP address. (129.74.162.69)
3 - Use gethostbyaddr(address) to find the full name. (hedwig.cse.nd.edu)
The problem is, step 2 fails if /etc/hosts has the above configuration,
because the IP address is found to be 127.0.0.1, and the full name
is found to be localhost.localdomain. Hilarity ensues.
The solution (which is suggested in the masterlog) is to
change /etc/hosts to be the following:
127.0.0.1 localhost localhost.localdomain
Now, if this was a rare problem, then that's probably enough said.
But, it's not rare any more. EVERY Linux installation of Condor
that I have seen in the last three years has suffered from this
problem, causing long debugging sessions, frustrated users,
and people giving up because Condor "doesn't work".
May I suggest that it would be a good PR move to find a better
solution to this problem?
Some ideas for discussion:
1 - Instead of logging the problem in an obscure file, modify the Condor
tools to report an obnoxious error message that states your Linux
installation is BROKEN, and here is how to fix it.
Comments: *Very* easy to implement.
2 - Extract more local information. If the local uname is not a full
dns name, then infer the containing domain from /etc/resolv.conf,
and try to resolve that combination of names.
I don't think that this is viable... I don't think that we can rely
on /etc/resolv.conf -- the host may not have a DNS record that matches it's
view of it's own host name. Also, I don't think that there's a API for
walking through the file. (I could be wrong on this point, however).
3 - Take the naming system out of the loop. Make a TCP connection to
the collector as if doing a condor_status, then a getsockname to
get the local IP address with respect to the collector.
Not sure what I think about this one. Like Greg's #4 below, this probably
doesn't correctly handle the case in which the CM is also running a startd
and/or schedd.
Thoughts? Ideas?
To these, we could add:
4 - (Greg T) Change BIND_ALL_INTERFACES to default as true. This does solve
the problem in many cases.
5 - Modify the code to walk through all available interfaces, and rank them
based on the "publicness" of it's IP address. 127.* interfaces would get the
lowest score, 192.168.* (and related addresses) would get a middle score, and
truly public addresses would get the highest score. Condor would then bind
to the interface with the highest score. For Linux, we can gather this info
via the netdevice(7) call -- I don't know yet how we could gather this info
on other platforms (other than, I suppose, running /sbin/ifconfig and parsing
it's output). This behavior could be turned off via a configuration knob
(yeah, another one), perhaps something like BIND_BEST_INTERFACE = true/false.
#4 doesn't handle the case in which the machine in question is the central
manager and is running a startd and/or schedd, the ads from these would have
the address of the loopback (over which they would contact the collector).
Thoughts?
Also, I've now modified daemon core and Sock and a couple of other places to
handle loopbacks other than 127.0.0.1 (it now masks it with 255.0.0.0 to
check whether the given address is a loopback). This is to handle systems
like OpenSuse which now add another address of 127.0.0.2 -- in /etc/hosts,
the .1 address is "localhost", and the .2 represents the host name
(i.e. "127.0.0.2 myhost"). I haven't committed this code yet, but it does
appear to work.
Thanks
-Nick