[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Selected NETWORK_INTERFACE not always honored?



Hi,

We seem to have cases where the NETWORK_INTERFACE defined is not always honored, and I'd be grateful if anyone can shed some light as to why this might happen. Our grid uses many flocked pools, all of which operate on a number of RFC 1918 addresses in172.24.*.* networks, so even if machines have global IP addresses (on, say, eth0), they are given additional virtual interfaces on eth0:1 but using the same physical NIC. Condor is then forced to bind to these addresses by setting NETWORK_INTERFACE in the condor_config.local files. However, there seem to be times when a submit machine may use its global IP address instead to communicate with an execute node, causing it to fall foul of any firewall in between which is configured to only allow through the traffic originating from RFC 1918 addresses. Here's a snippet from a ShadowLog when this happens:

2/6 01:40:02 (65.172) (10961):connect returns -1, errno = 110
2/6 01:40:02 (65.172) (10961):failed to connect to scheduler on <172.24.116.237:10265>

2/6 01:40:04 (65.159) (10962):connect returns -1, errno = 110
2/6 01:40:04 (65.159) (10962):failed to connect to scheduler on <172.24.116.251:10439>

2/6 01:40:08 (65.153) (10964):connect returns -1, errno = 110
2/6 01:40:08 (65.153) (10964):failed to connect to scheduler on <172.24.116.224:9640>

2/6 01:40:10 (65.152) (10965):connect returns -1, errno = 110
2/6 01:40:10 (65.152) (10965):failed to connect to scheduler on <172.24.116.199:9675>

2/6 01:40:12 (65.166) (10966):connect returns -1, errno = 110
2/6 01:40:12 (65.166) (10966):failed to connect to scheduler on <172.24.116.233:9825>

The log for the firewall protecting the execute nodes confirms that these connections were attempted by the submit host on its global IP address, and not the one set by NETWORK_INTERFACE. This occurs after the jobs have started, so initial comms must have used the correct source address to get through, which is what I find confusing. Why should the submit host suddenly start using its other, non-nominated, IP address? I'm also confused as to why the submit node is trying to communicate to a *scheduler*, but one thing at a time. These are all linux boxes running 6.8.3.

Thanks for any help,
Mark