[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Setting up condor on ec2 machines



(inline)

On 08/25/2011 08:26 AM, tog wrote:
Hi

I am trying to set up a full installation on EC2 by full I mean that
all processes will be internal to the cloud (while I would like to
keep the possibility to have the manager internally to my
organization.
Therefore I decided to set the hostnames of all machines to their
public DNS names (which is not the default on EC2)

CM+Sched -(firewall)-> {dragons} -> {EC2: Startd} ?

That's a pretty popular configuration it seems. The Cloud Foundations architecture documents have a walk-through of that, but using Ubuntu you may not have access.


I have 2 machines running Ubuntu Lucid with condor coming from the
distribution itself (7.2.4)

Master  ec2-75-101-194-52.compute-1.amazonaws.com      75.101.194.52
10.116.39.128
Worker  ec2-50-16-126-254.compute-1.amazonaws.com      50.16.126.254
10.112.45.248

First problem is possibly that 7.2.4 is painfully old with many known bugs at this point. Newer versions include the shared_port daemon, which helps significantly with firewall configuration.

http://spinningmatt.wordpress.com/2011/06/21/getting-started-multiple-node-condor-pool-with-firewalls/


I have the following error messages that I cannot explain:

MasterLog

8/25 11:17:36 DaemonCore: Command Socket at<75.101.194.52:40014>
8/25 11:17:36 Failed to listen(9618) on TCP command socket.
8/25 11:17:36 ERROR: Create_Process failed trying to start
/usr/sbin/condor_collector
8/25 11:17:36 restarting /usr/sbin/condor_collector in 10 seconds

Failed to listen on 9618, what's on that port (netstat -tl)?


8/25 11:17:36 Started DaemonCore process "/usr/sbin/condor_startd",
pid and pgroup = 3258
8/25 11:17:36 Started DaemonCore process "/usr/sbin/condor_schedd",
pid and pgroup = 3259
8/25 11:17:36 Started DaemonCore process
"/usr/sbin/condor_negotiator", pid and pgroup = 3274
8/25 11:17:36 condor_write(): Socket closed when trying to write 973
bytes to<10.116.39.128:9618>, fd is 8
8/25 11:17:36 Buf::write(): condor_write() failed
8/25 11:17:36 Failed to send non-blocking update to<10.116.39.128:9618>.
8/25 11:17:36 PERMISSION DENIED to unauthenticated user from host
10.116.39.128 for command 60008 (DC_CHILDALIVE), access level DAEMON:
reason: DAEMON authorizat
ion policy contains no matching ALLOW entry for this request;
identifiers used for this host:
10.116.39.128,ip-10-116-39-128.ec2.internal
8/25 11:17:36 PERMISSION DENIED to unauthenticated user from host
10.116.39.128 for command 60008 (DC_CHILDALIVE), access level DAEMON:
reason: cached result for
  DAEMON; see first case for the full reason
8/25 11:17:41 PERMISSION DENIED to unauthenticated user from host
10.116.39.128 for command 60008 (DC_CHILDALIVE), access level DAEMON:
reason: cached result for
  DAEMON; see first case for the full reason
8/25 11:17:46 Failed to listen(9618) on TCP command socket.
8/25 11:17:46 ERROR: Create_Process failed trying to start
/usr/sbin/condor_collector
8/25 11:17:46 restarting /usr/sbin/condor_collector in 120 seconds
8/25 11:17:46 condor_write(): Socket closed when trying to write 1057
bytes to unknown source, fd is 8, errno=104
8/25 11:17:46 Buf::write(): condor_write() failed

CHILDALIVE failing because the node is not configured to talk to itself at the DAEMON access level, you'll need something like ALLOW_DAEMON = ..., $(IP_ADDRESS), $(FULL_HOSTNAME)

This isn't fatal, but it means that the master cannot watch for hung daemons.

http://spinningmatt.wordpress.com/2009/10/21/condor_master-for-managing-processes/


SchedLog

8/25 11:18:06 (pid:1505) condor_write(): Socket closed when trying to
write 218 bytes to unknown source, fd is 12, errno=104
8/25 11:18:06 (pid:1505) Buf::write(): condor_write() failed
8/25 11:18:06 (pid:1505) All shadows are gone, exiting.
8/25 11:18:06 (pid:1505) **** condor_schedd (condor_SCHEDD) pid 1505
EXITING WITH STATUS 0

This could be a number of things. Best to leave it until you have the other issues resolved.


My configuration files are the followings:

Master node

PRIVATE_NETWORK_NAME=amazon-ec2-us-east-1d
TCP_FORWARDING_HOST=75.101.194.52
PRIVATE_NETWORK_INTERFACE=10.116.39.128
UPDATE_COLLECTOR_WITH_TCP=True
HOSTALLOW_WRITE=$(ALLOW_WRITE), '*.internal'
HOSTALLOW_READ=$(ALLOW_READ),'*.internal'
LOWPORT=40000
HIGHPORT=40050
COLLECTOR_SOCKET_CACHE_SIZE=1000

'*.internal' is matching my internal hostnames

Slave node

ubuntu@ec2-50-16-126-254:~$ sudo more /etc/condor_conf*
COLLECTOR_HOST = ec2-75-101-194-52.compute-1.amazonaws.com
PRIVATE_NETWORK_NAME=amazon-ec2-us-east-1d
TCP_FORWARDING_HOST = 50.16.126.154
PRIVATE_NETWORK_INTERFACE = 10.112.45.248
COUNT_HYPERTHREAD_CPUS = False
DAEMON_LIST = MASTER, STARTD
UPDATE_COLLECTOR_WITH_TCP = True
#COUNT_HYPERTHREAD_CPUS = False
#UPDATE_INTERVAL = $RANDOM_INTEGER(230, 370)
#MASTER_UPDATE_INTERVAL = $RANDOM_INTEGER(230, 370)
LOWPORT=40000
HIGHPORT=40050
DAEMON_LIST = MASTER, STARTD

Thanks
Guillaume

Along with getting away from condor 7.2, you should get away from HOSTALLOW_* and always avoid mixing HOSTALLOW_* and ALLOW_*.

7.2 has bugs in the TCP_FORWARDING_HOST, some discovered while creating the Cloud Foundations reference architecture.

Best,


matt