Many thanks for the quick reply on this. Unfortunately the Uni is still in lockdown at present so it's difficult to actually go in and do any hands on testing. Having said that it *may* be possible to login remotely to some machines and have a peek at what's
going on. I know we do have some machines with IPv6 addresses (but they have IPv4 as well) - so that may be a cause.
The comment about the service startup order is interesting. If this isn't explicity set then I could imagine a race condition between htcondor and the network service which would explain why some machines get the correct interface address and some get the loopback. I'll get back to you when I have some more information.
thanks again,
-ian.
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Sent: 21 May 2020 15:43:12 To: HTCondor-Users Mail List Subject: Re: [HTCondor-users] execute hosts advertise loopback address If you log in to one of these machines and run condor_config_val IP_ADDRESS is the result 127.0.0.1 ? This would indicate that Htcondor is unable to determine which interface is external, OR that it has been explicitly configured to bind only to the loopback. try condor_config_val -dump NETWORK is NETWORK_INTERFACE set to something? Do the public interfaces of these machines perhaps have IPv4 disabled, so they are IPv6 only?
A newer HTCondor like 8.8.9 will have better support for IPv6, including the ability to prefer it or to prefer IPv4 If you restart condor on the machine, does it continue to advertise the loopback? If so, the problem may be that the network is not initialized and so only the loopback can be found at the time that condor_starts up. You might also want to check in the services control panel to make sure that Htcondor is not started until after the network service, this should be setup automatically by the MSI installer package, but it’s worth checking. -tj From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
On Behalf Of Smith, Ian Hello All, I have come across a very strange problem with our HTCondor pool whereby *some* execute hosts advertise the loopback address as the address of the startd as evidenced by this from CollectorLog: 05/21/20 14:09:35 StartdAd : Inserting ** "<
slot1@xxxxxxxxxxxxxxxxxxxxxxxx , 127.0.0.1 >" Some execute hosts work fine and advertise their correct address whereas a substantial number advertise the
loopback and I believe there are even examples of both on the same subnet. The execute hosts all run Windows 10 and HTCondor version 8.4.6 and employ power saving so that idle machines (viz no local user use or HTCondor use) go into hibernation after approx 10 minutes.
A typical scenario is that I wake a machine to a run job, the machine advertises its loopback address to the collector. The negotiator either finds a match or ignores the loopback - no quite sure which. but in any case the job never starts on the execute host and so the host returns to hibernation. I turned up this submission to htcondor-users in the archives but it seems pretty old (Windows XP) and doesn't seem to come up with a satisfactory solution: Any suggestions would be extremely useful as I'm totally baffled by this. regards, -ian. Dr Ian C. Smith, Condor Manager, Advanced Research Computing, University of Liverpool
UK. |