[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Multiple Network Interface cards and central managernot communicating with execute machine.




Hi Charles,
 
You can try to add the following:
NETWORK_INTERFACE=your specific network interface
into the configuration file to see if it works.
 
Good luck!
 
-Hailong
 
2009-11-20

***********************************************
* Hailong Yang, PhD. Candidate
* Sino-German Joint Software Institute,
* School of Computer Science&Engineering, Beihang University
* Phone: (86-010)82315908
* Email: hailong.yang1115@xxxxxxxxx
* Address: G413, New Main Building in Beihang University,
*              No.37 XueYuan Road,HaiDian District,
*              Beijing,P.R.China,100191
***********************************************

发件人: Charles Embry
发送时间: 2009-11-20  05:29:53
收件人: condor-users
抄送:
主题: [Condor-users] Multiple Network Interface cards and central managernot communicating with execute machine.
The condor pool that I am trying to set up is on the same server rack/router and the machines can ping each other and ssh each other. But in condor they don;t seem to be communicating, condor_status never shows the the execute machine that I am trying to add to the central manager(that is also a submit and execute machine) . The machines are all sunfire Sun mirosystems servers. they all have 4 NICS, (Network Interface cards) We are only using one(we have no need at this time to use all of them) and the other three on each machine is not hooked up to anything.

On the execute machine i get this error in the logs fie

Master log__________

11/16 17:07:18 DaemonCore: Command Socket at <144.167.99.201:49652>
11/16 17:07:18 Started DaemonCore process "/root/Desktop/condor-7.2.4/sbin/condor_startd", pid and pgroup = 27436
11/16 17:07:23 attempt to connect to <144.167.99.210:9618> failed: No route to host (connect errno = 113).  Will keep trying for 20 total seconds (20 to go).

11/16 17:07:44 attempt to connect to <144.167.99.210:9618> failed: No route to host (connect errno = 113).

StartLog__________
11/19 15:48:58 slot1: State change: IS_OWNER is false
11/19 15:48:58 slot1: Changing state: Owner -> Unclaimed
11/19 15:49:23 attempt to connect to <144.167.99.210:9618> failed: No route to host (connect errno = 113).
11/19 15:49:23 ERROR: SECMAN:2004:Was waiting for TCP auth session to <144.167.99.210:9618>, but it failed.
11/19 15:49:23 Failed to start non-blocking update to <144.167.99.210:9618>.
11/19 15:49:23 ERROR: SECMAN:2004:Was waiting for TCP auth session to <144.167.99.210:9618>, but it failed.
11/19 15:49:23 Failed to start non-blocking update to <144.167.99.210:9618>.
11/19 15:49:23 ERROR: SECMAN:2004:Was waiting for TCP auth session to <144.167.99.210:9618>, but it failed.
11/19 15:49:23 Failed to start non-blocking update to <144.167.99.210:9618>.
11/19 15:49:23 ERROR: SECMAN:2004:Failed to create security session to <144.167.99.210:9618> with TCP.|SECMAN:2003:TCP connection to <144.167.99.210:9618> failed.

The condor_collector Dameon  is using the 9618 socket  on the central manager and thats the socket on the central manager that the execute machine is trying to connect to.. Why do the machines not connect in condor(No route to host??) when they can ping and ssh each other? Do i need to set something to make condor use the only network interface that is connected,? Or is it the socket that is being used by the collector on the central manager?                


Thanks for the much needed help.