Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Setting up condor on ec2 machines
- Date: Thu, 25 Aug 2011 08:56:08 -0400
- From: Matthew Farrellee <matt@xxxxxxxxxx>
- Subject: Re: [Condor-users] Setting up condor on ec2 machines
(inline)
On 08/25/2011 08:26 AM, tog wrote:
Hi
I am trying to set up a full installation on EC2 by full I mean that
all processes will be internal to the cloud (while I would like to
keep the possibility to have the manager internally to my
organization.
Therefore I decided to set the hostnames of all machines to their
public DNS names (which is not the default on EC2)
CM+Sched -(firewall)-> {dragons} -> {EC2: Startd} ?
That's a pretty popular configuration it seems. The Cloud Foundations
architecture documents have a walk-through of that, but using Ubuntu you
may not have access.
I have 2 machines running Ubuntu Lucid with condor coming from the
distribution itself (7.2.4)
Master ec2-75-101-194-52.compute-1.amazonaws.com 75.101.194.52
10.116.39.128
Worker ec2-50-16-126-254.compute-1.amazonaws.com 50.16.126.254
10.112.45.248
First problem is possibly that 7.2.4 is painfully old with many known
bugs at this point. Newer versions include the shared_port daemon, which
helps significantly with firewall configuration.
http://spinningmatt.wordpress.com/2011/06/21/getting-started-multiple-node-condor-pool-with-firewalls/
I have the following error messages that I cannot explain:
MasterLog
8/25 11:17:36 DaemonCore: Command Socket at<75.101.194.52:40014>
8/25 11:17:36 Failed to listen(9618) on TCP command socket.
8/25 11:17:36 ERROR: Create_Process failed trying to start
/usr/sbin/condor_collector
8/25 11:17:36 restarting /usr/sbin/condor_collector in 10 seconds
Failed to listen on 9618, what's on that port (netstat -tl)?
8/25 11:17:36 Started DaemonCore process "/usr/sbin/condor_startd",
pid and pgroup = 3258
8/25 11:17:36 Started DaemonCore process "/usr/sbin/condor_schedd",
pid and pgroup = 3259
8/25 11:17:36 Started DaemonCore process
"/usr/sbin/condor_negotiator", pid and pgroup = 3274
8/25 11:17:36 condor_write(): Socket closed when trying to write 973
bytes to<10.116.39.128:9618>, fd is 8
8/25 11:17:36 Buf::write(): condor_write() failed
8/25 11:17:36 Failed to send non-blocking update to<10.116.39.128:9618>.
8/25 11:17:36 PERMISSION DENIED to unauthenticated user from host
10.116.39.128 for command 60008 (DC_CHILDALIVE), access level DAEMON:
reason: DAEMON authorizat
ion policy contains no matching ALLOW entry for this request;
identifiers used for this host:
10.116.39.128,ip-10-116-39-128.ec2.internal
8/25 11:17:36 PERMISSION DENIED to unauthenticated user from host
10.116.39.128 for command 60008 (DC_CHILDALIVE), access level DAEMON:
reason: cached result for
DAEMON; see first case for the full reason
8/25 11:17:41 PERMISSION DENIED to unauthenticated user from host
10.116.39.128 for command 60008 (DC_CHILDALIVE), access level DAEMON:
reason: cached result for
DAEMON; see first case for the full reason
8/25 11:17:46 Failed to listen(9618) on TCP command socket.
8/25 11:17:46 ERROR: Create_Process failed trying to start
/usr/sbin/condor_collector
8/25 11:17:46 restarting /usr/sbin/condor_collector in 120 seconds
8/25 11:17:46 condor_write(): Socket closed when trying to write 1057
bytes to unknown source, fd is 8, errno=104
8/25 11:17:46 Buf::write(): condor_write() failed
CHILDALIVE failing because the node is not configured to talk to itself
at the DAEMON access level, you'll need something like ALLOW_DAEMON =
..., $(IP_ADDRESS), $(FULL_HOSTNAME)
This isn't fatal, but it means that the master cannot watch for hung
daemons.
http://spinningmatt.wordpress.com/2009/10/21/condor_master-for-managing-processes/
SchedLog
8/25 11:18:06 (pid:1505) condor_write(): Socket closed when trying to
write 218 bytes to unknown source, fd is 12, errno=104
8/25 11:18:06 (pid:1505) Buf::write(): condor_write() failed
8/25 11:18:06 (pid:1505) All shadows are gone, exiting.
8/25 11:18:06 (pid:1505) **** condor_schedd (condor_SCHEDD) pid 1505
EXITING WITH STATUS 0
This could be a number of things. Best to leave it until you have the
other issues resolved.
My configuration files are the followings:
Master node
PRIVATE_NETWORK_NAME=amazon-ec2-us-east-1d
TCP_FORWARDING_HOST=75.101.194.52
PRIVATE_NETWORK_INTERFACE=10.116.39.128
UPDATE_COLLECTOR_WITH_TCP=True
HOSTALLOW_WRITE=$(ALLOW_WRITE), '*.internal'
HOSTALLOW_READ=$(ALLOW_READ),'*.internal'
LOWPORT=40000
HIGHPORT=40050
COLLECTOR_SOCKET_CACHE_SIZE=1000
'*.internal' is matching my internal hostnames
Slave node
ubuntu@ec2-50-16-126-254:~$ sudo more /etc/condor_conf*
COLLECTOR_HOST = ec2-75-101-194-52.compute-1.amazonaws.com
PRIVATE_NETWORK_NAME=amazon-ec2-us-east-1d
TCP_FORWARDING_HOST = 50.16.126.154
PRIVATE_NETWORK_INTERFACE = 10.112.45.248
COUNT_HYPERTHREAD_CPUS = False
DAEMON_LIST = MASTER, STARTD
UPDATE_COLLECTOR_WITH_TCP = True
#COUNT_HYPERTHREAD_CPUS = False
#UPDATE_INTERVAL = $RANDOM_INTEGER(230, 370)
#MASTER_UPDATE_INTERVAL = $RANDOM_INTEGER(230, 370)
LOWPORT=40000
HIGHPORT=40050
DAEMON_LIST = MASTER, STARTD
Thanks
Guillaume
Along with getting away from condor 7.2, you should get away from
HOSTALLOW_* and always avoid mixing HOSTALLOW_* and ALLOW_*.
7.2 has bugs in the TCP_FORWARDING_HOST, some discovered while creating
the Cloud Foundations reference architecture.
Best,
matt