Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Selected NETWORK_INTERFACE not always honored?
- Date: Tue, 06 Feb 2007 15:14:22 +0000
- From: Mark Calleja <M.Calleja@xxxxxxxxxxxxxxx>
- Subject: [Condor-users] Selected NETWORK_INTERFACE not always honored?
Hi,
We seem to have cases where the NETWORK_INTERFACE defined is not always
honored, and I'd be grateful if anyone can shed some light as to why
this might happen. Our grid uses many flocked pools, all of which
operate on a number of RFC 1918 addresses in172.24.*.* networks, so even
if machines have global IP addresses (on, say, eth0), they are given
additional virtual interfaces on eth0:1 but using the same physical NIC.
Condor is then forced to bind to these addresses by setting
NETWORK_INTERFACE in the condor_config.local files. However, there seem
to be times when a submit machine may use its global IP address instead
to communicate with an execute node, causing it to fall foul of any
firewall in between which is configured to only allow through the
traffic originating from RFC 1918 addresses. Here's a snippet from a
ShadowLog when this happens:
2/6 01:40:02 (65.172) (10961):connect returns -1, errno = 110
2/6 01:40:02 (65.172) (10961):failed to connect to scheduler on
<172.24.116.237:10265>
2/6 01:40:04 (65.159) (10962):connect returns -1, errno = 110
2/6 01:40:04 (65.159) (10962):failed to connect to scheduler on
<172.24.116.251:10439>
2/6 01:40:08 (65.153) (10964):connect returns -1, errno = 110
2/6 01:40:08 (65.153) (10964):failed to connect to scheduler on
<172.24.116.224:9640>
2/6 01:40:10 (65.152) (10965):connect returns -1, errno = 110
2/6 01:40:10 (65.152) (10965):failed to connect to scheduler on
<172.24.116.199:9675>
2/6 01:40:12 (65.166) (10966):connect returns -1, errno = 110
2/6 01:40:12 (65.166) (10966):failed to connect to scheduler on
<172.24.116.233:9825>
The log for the firewall protecting the execute nodes confirms that
these connections were attempted by the submit host on its global IP
address, and not the one set by NETWORK_INTERFACE. This occurs after the
jobs have started, so initial comms must have used the correct source
address to get through, which is what I find confusing. Why should the
submit host suddenly start using its other, non-nominated, IP address?
I'm also confused as to why the submit node is trying to communicate to
a *scheduler*, but one thing at a time. These are all linux boxes
running 6.8.3.
Thanks for any help,
Mark