Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Setting up condor on ec2 machines

Date: Thu, 25 Aug 2011 08:56:08 -0400
From: Matthew Farrellee <matt@xxxxxxxxxx>
Subject: Re: [Condor-users] Setting up condor on ec2 machines

(inline)

On 08/25/2011 08:26 AM, tog wrote:

Hi

I am trying to set up a full installation on EC2 by full I mean that
all processes will be internal to the cloud (while I would like to
keep the possibility to have the manager internally to my
organization.
Therefore I decided to set the hostnames of all machines to their
public DNS names (which is not the default on EC2)


CM+Sched -(firewall)-> {dragons} -> {EC2: Startd} ?

That's a pretty popular configuration it seems. The Cloud Foundationsarchitecture documents have a walk-through of that, but using Ubuntu youmay not have access.

I have 2 machines running Ubuntu Lucid with condor coming from the
distribution itself (7.2.4)

Master  ec2-75-101-194-52.compute-1.amazonaws.com      75.101.194.52
10.116.39.128
Worker  ec2-50-16-126-254.compute-1.amazonaws.com      50.16.126.254
10.112.45.248

First problem is possibly that 7.2.4 is painfully old with many knownbugs at this point. Newer versions include the shared_port daemon, whichhelps significantly with firewall configuration.


http://spinningmatt.wordpress.com/2011/06/21/getting-started-multiple-node-condor-pool-with-firewalls/

I have the following error messages that I cannot explain:

MasterLog

8/25 11:17:36 DaemonCore: Command Socket at<75.101.194.52:40014>
8/25 11:17:36 Failed to listen(9618) on TCP command socket.
8/25 11:17:36 ERROR: Create_Process failed trying to start
/usr/sbin/condor_collector
8/25 11:17:36 restarting /usr/sbin/condor_collector in 10 seconds


Failed to listen on 9618, what's on that port (netstat -tl)?

8/25 11:17:36 Started DaemonCore process "/usr/sbin/condor_startd",
pid and pgroup = 3258
8/25 11:17:36 Started DaemonCore process "/usr/sbin/condor_schedd",
pid and pgroup = 3259
8/25 11:17:36 Started DaemonCore process
"/usr/sbin/condor_negotiator", pid and pgroup = 3274
8/25 11:17:36 condor_write(): Socket closed when trying to write 973
bytes to<10.116.39.128:9618>, fd is 8
8/25 11:17:36 Buf::write(): condor_write() failed
8/25 11:17:36 Failed to send non-blocking update to<10.116.39.128:9618>.
8/25 11:17:36 PERMISSION DENIED to unauthenticated user from host
10.116.39.128 for command 60008 (DC_CHILDALIVE), access level DAEMON:
reason: DAEMON authorizat
ion policy contains no matching ALLOW entry for this request;
identifiers used for this host:
10.116.39.128,ip-10-116-39-128.ec2.internal
8/25 11:17:36 PERMISSION DENIED to unauthenticated user from host
10.116.39.128 for command 60008 (DC_CHILDALIVE), access level DAEMON:
reason: cached result for
  DAEMON; see first case for the full reason
8/25 11:17:41 PERMISSION DENIED to unauthenticated user from host
10.116.39.128 for command 60008 (DC_CHILDALIVE), access level DAEMON:
reason: cached result for
  DAEMON; see first case for the full reason
8/25 11:17:46 Failed to listen(9618) on TCP command socket.
8/25 11:17:46 ERROR: Create_Process failed trying to start
/usr/sbin/condor_collector
8/25 11:17:46 restarting /usr/sbin/condor_collector in 120 seconds
8/25 11:17:46 condor_write(): Socket closed when trying to write 1057
bytes to unknown source, fd is 8, errno=104
8/25 11:17:46 Buf::write(): condor_write() failed

CHILDALIVE failing because the node is not configured to talk to itselfat the DAEMON access level, you'll need something like ALLOW_DAEMON =..., $(IP_ADDRESS), $(FULL_HOSTNAME)

This isn't fatal, but it means that the master cannot watch for hungdaemons.


http://spinningmatt.wordpress.com/2009/10/21/condor_master-for-managing-processes/

SchedLog

8/25 11:18:06 (pid:1505) condor_write(): Socket closed when trying to
write 218 bytes to unknown source, fd is 12, errno=104
8/25 11:18:06 (pid:1505) Buf::write(): condor_write() failed
8/25 11:18:06 (pid:1505) All shadows are gone, exiting.
8/25 11:18:06 (pid:1505) **** condor_schedd (condor_SCHEDD) pid 1505
EXITING WITH STATUS 0

This could be a number of things. Best to leave it until you have theother issues resolved.

My configuration files are the followings:

Master node

PRIVATE_NETWORK_NAME=amazon-ec2-us-east-1d
TCP_FORWARDING_HOST=75.101.194.52
PRIVATE_NETWORK_INTERFACE=10.116.39.128
UPDATE_COLLECTOR_WITH_TCP=True
HOSTALLOW_WRITE=$(ALLOW_WRITE), '*.internal'
HOSTALLOW_READ=$(ALLOW_READ),'*.internal'
LOWPORT=40000
HIGHPORT=40050
COLLECTOR_SOCKET_CACHE_SIZE=1000

'*.internal' is matching my internal hostnames

Slave node

ubuntu@ec2-50-16-126-254:~$ sudo more /etc/condor_conf*
COLLECTOR_HOST = ec2-75-101-194-52.compute-1.amazonaws.com
PRIVATE_NETWORK_NAME=amazon-ec2-us-east-1d
TCP_FORWARDING_HOST = 50.16.126.154
PRIVATE_NETWORK_INTERFACE = 10.112.45.248
COUNT_HYPERTHREAD_CPUS = False
DAEMON_LIST = MASTER, STARTD
UPDATE_COLLECTOR_WITH_TCP = True
#COUNT_HYPERTHREAD_CPUS = False
#UPDATE_INTERVAL = $RANDOM_INTEGER(230, 370)
#MASTER_UPDATE_INTERVAL = $RANDOM_INTEGER(230, 370)
LOWPORT=40000
HIGHPORT=40050
DAEMON_LIST = MASTER, STARTD

Thanks
Guillaume

Along with getting away from condor 7.2, you should get away fromHOSTALLOW_* and always avoid mixing HOSTALLOW_* and ALLOW_*.

7.2 has bugs in the TCP_FORWARDING_HOST, some discovered while creatingthe Cloud Foundations reference architecture.


Best,


matt

Follow-Ups:
- Re: [Condor-users] Setting up condor on ec2 machines
  - From: tog

References:
- [Condor-users] Setting up condor on ec2 machines
  - From: tog

Prev by Date: [Condor-users] Setting up condor on ec2 machines
Next by Date: Re: [Condor-users] Setting up condor on ec2 machines
Previous by thread: [Condor-users] Setting up condor on ec2 machines
Next by thread: Re: [Condor-users] Setting up condor on ec2 machines
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [Condor-users] Setting up condor on ec2 machines