HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-devel] localhost on linux




Howdy all -

As I'm sure everyone knows by now, Condor gets confused when
Linux has an /etc/hosts with the following configuration:

	127.0.0.1 hedwig localhost localhost.localdomain

Why does Condor get confused?  It attempts to determine
the local IP address and full DNS name by the following procedure:

1 - Use uname to determine the short hostname. (hedwig)
2 - Use gethostbyname(shortname) to find the IP address. (129.74.162.69)
3 - Use gethostbyaddr(address) to find the full name. (hedwig.cse.nd.edu)

The problem is, step 2 fails if /etc/hosts has the above configuration,
because the IP address is found to be 127.0.0.1, and the full name
is found to be localhost.localdomain.  Hilarity ensues.

The solution (which is suggested in the masterlog) is to
change /etc/hosts to be the following:

	127.0.0.1 localhost localhost.localdomain

Now, if this was a rare problem, then that's probably enough said.

But, it's not rare any more.  EVERY Linux installation of Condor
that I have seen in the last three years has suffered from this
problem, causing long debugging sessions, frustrated users,
and people giving up because Condor "doesn't work".

May I suggest that it would be a good PR move to find a better
solution to this problem?

Some ideas for discussion:

1 - Instead of logging the problem in an obscure file, modify the Condor tools to report an obnoxious error message that states your Linux installation is BROKEN, and here is how to fix it.

2 - Extract more local information.  If the local uname is not a full
dns name, then infer the containing domain from /etc/resolv.conf,
and try to resolve that combination of names.

3 - Take the naming system out of the loop.  Make a TCP connection to
the collector as if doing a condor_status, then a getsockname to
get the local IP address with respect to the collector.

Thoughts?  Ideas?

Doug