Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Setting up condor on ec2 machines
- Date: Thu, 25 Aug 2011 08:50:54 -0500
- From: "Timothy St. Clair" <tstclair@xxxxxxxxxx>
- Subject: Re: [Condor-users] Setting up condor on ec2 machines
Just as a side note, I'm in the process of creating a full outline of
how to configure + setup Fedora to allow spill-over to the cloud, using
the JobRouter.
I'll ping back this thread when I'm done and have something functioning.
I'll also make the AMI's publicly available.
Cheers,
Tim
On Thu, 2011-08-25 at 19:14 +0530, tog wrote:
> Hi Matt
>
> Thanks for the quick answer ...
> >From what I read I should move to latest 7.6.3 ;-)
>
> Anyway I do believe that until I don't move one of the role outside of
> EC2, I should not face firewall problems. I think all ports are open
> inside a pool of machines belonging to the same EC2 reservation.
>
> Here is the port info:
>
> ubuntu@ec2-75-101-194-52:~$ sudo netstat -tlnp
> Active Internet connections (only servers)
> Proto Recv-Q Send-Q Local Address Foreign Address
> State PID/Program name
> tcp 0 0 0.0.0.0:22 0.0.0.0:*
> LISTEN 472/sshd
> tcp 0 0 0.0.0.0:40002 0.0.0.0:*
> LISTEN 3274/condor_negotia
> tcp 0 0 0.0.0.0:40034 0.0.0.0:*
> LISTEN 3258/condor_startd
> tcp 0 0 0.0.0.0:40009 0.0.0.0:*
> LISTEN 3259/condor_schedd
> tcp 0 0 0.0.0.0:40014 0.0.0.0:*
> LISTEN 3257/condor_master
> tcp 0 0 0.0.0.0:9618 0.0.0.0:*
> LISTEN 3292/condor_collect
> tcp 0 0 0.0.0.0:40019 0.0.0.0:*
> LISTEN 3259/condor_schedd
> tcp6 0 0 :::22 :::*
> LISTEN 472/sshd
> tcp6 0 0 :::1527 :::*
> LISTEN 3024/java
>
> Will keep you updated once I will have 7.6.3 install ... I might be
> able to use condor_configure.
>
> Best Regards
> Guillaume
>
>
>
> On Thu, Aug 25, 2011 at 6:26 PM, Matthew Farrellee <matt@xxxxxxxxxx> wrote:
> > (inline)
> >
> > On 08/25/2011 08:26 AM, tog wrote:
> >>
> >> Hi
> >>
> >> I am trying to set up a full installation on EC2 by full I mean that
> >> all processes will be internal to the cloud (while I would like to
> >> keep the possibility to have the manager internally to my
> >> organization.
> >> Therefore I decided to set the hostnames of all machines to their
> >> public DNS names (which is not the default on EC2)
> >
> > CM+Sched -(firewall)-> {dragons} -> {EC2: Startd} ?
> >
> > That's a pretty popular configuration it seems. The Cloud Foundations
> > architecture documents have a walk-through of that, but using Ubuntu you may
> > not have access.
> >
> >
> >> I have 2 machines running Ubuntu Lucid with condor coming from the
> >> distribution itself (7.2.4)
> >>
> >> Master ec2-75-101-194-52.compute-1.amazonaws.com 75.101.194.52
> >> 10.116.39.128
> >> Worker ec2-50-16-126-254.compute-1.amazonaws.com 50.16.126.254
> >> 10.112.45.248
> >
> > First problem is possibly that 7.2.4 is painfully old with many known bugs
> > at this point. Newer versions include the shared_port daemon, which helps
> > significantly with firewall configuration.
> >
> > http://spinningmatt.wordpress.com/2011/06/21/getting-started-multiple-node-condor-pool-with-firewalls/
> >
> >
> >> I have the following error messages that I cannot explain:
> >>
> >> MasterLog
> >>
> >> 8/25 11:17:36 DaemonCore: Command Socket at<75.101.194.52:40014>
> >> 8/25 11:17:36 Failed to listen(9618) on TCP command socket.
> >> 8/25 11:17:36 ERROR: Create_Process failed trying to start
> >> /usr/sbin/condor_collector
> >> 8/25 11:17:36 restarting /usr/sbin/condor_collector in 10 seconds
> >
> > Failed to listen on 9618, what's on that port (netstat -tl)?
> >
> >
> >> 8/25 11:17:36 Started DaemonCore process "/usr/sbin/condor_startd",
> >> pid and pgroup = 3258
> >> 8/25 11:17:36 Started DaemonCore process "/usr/sbin/condor_schedd",
> >> pid and pgroup = 3259
> >> 8/25 11:17:36 Started DaemonCore process
> >> "/usr/sbin/condor_negotiator", pid and pgroup = 3274
> >> 8/25 11:17:36 condor_write(): Socket closed when trying to write 973
> >> bytes to<10.116.39.128:9618>, fd is 8
> >> 8/25 11:17:36 Buf::write(): condor_write() failed
> >> 8/25 11:17:36 Failed to send non-blocking update to<10.116.39.128:9618>.
> >> 8/25 11:17:36 PERMISSION DENIED to unauthenticated user from host
> >> 10.116.39.128 for command 60008 (DC_CHILDALIVE), access level DAEMON:
> >> reason: DAEMON authorizat
> >> ion policy contains no matching ALLOW entry for this request;
> >> identifiers used for this host:
> >> 10.116.39.128,ip-10-116-39-128.ec2.internal
> >> 8/25 11:17:36 PERMISSION DENIED to unauthenticated user from host
> >> 10.116.39.128 for command 60008 (DC_CHILDALIVE), access level DAEMON:
> >> reason: cached result for
> >> DAEMON; see first case for the full reason
> >> 8/25 11:17:41 PERMISSION DENIED to unauthenticated user from host
> >> 10.116.39.128 for command 60008 (DC_CHILDALIVE), access level DAEMON:
> >> reason: cached result for
> >> DAEMON; see first case for the full reason
> >> 8/25 11:17:46 Failed to listen(9618) on TCP command socket.
> >> 8/25 11:17:46 ERROR: Create_Process failed trying to start
> >> /usr/sbin/condor_collector
> >> 8/25 11:17:46 restarting /usr/sbin/condor_collector in 120 seconds
> >> 8/25 11:17:46 condor_write(): Socket closed when trying to write 1057
> >> bytes to unknown source, fd is 8, errno=104
> >> 8/25 11:17:46 Buf::write(): condor_write() failed
> >
> > CHILDALIVE failing because the node is not configured to talk to itself at
> > the DAEMON access level, you'll need something like ALLOW_DAEMON = ...,
> > $(IP_ADDRESS), $(FULL_HOSTNAME)
> >
> > This isn't fatal, but it means that the master cannot watch for hung
> > daemons.
> >
> > http://spinningmatt.wordpress.com/2009/10/21/condor_master-for-managing-processes/
> >
> >
> >> SchedLog
> >>
> >> 8/25 11:18:06 (pid:1505) condor_write(): Socket closed when trying to
> >> write 218 bytes to unknown source, fd is 12, errno=104
> >> 8/25 11:18:06 (pid:1505) Buf::write(): condor_write() failed
> >> 8/25 11:18:06 (pid:1505) All shadows are gone, exiting.
> >> 8/25 11:18:06 (pid:1505) **** condor_schedd (condor_SCHEDD) pid 1505
> >> EXITING WITH STATUS 0
> >
> > This could be a number of things. Best to leave it until you have the other
> > issues resolved.
> >
> >
> >> My configuration files are the followings:
> >>
> >> Master node
> >>
> >> PRIVATE_NETWORK_NAME=amazon-ec2-us-east-1d
> >> TCP_FORWARDING_HOST=75.101.194.52
> >> PRIVATE_NETWORK_INTERFACE=10.116.39.128
> >> UPDATE_COLLECTOR_WITH_TCP=True
> >> HOSTALLOW_WRITE=$(ALLOW_WRITE), '*.internal'
> >> HOSTALLOW_READ=$(ALLOW_READ),'*.internal'
> >> LOWPORT=40000
> >> HIGHPORT=40050
> >> COLLECTOR_SOCKET_CACHE_SIZE=1000
> >>
> >> '*.internal' is matching my internal hostnames
> >>
> >> Slave node
> >>
> >> ubuntu@ec2-50-16-126-254:~$ sudo more /etc/condor_conf*
> >> COLLECTOR_HOST = ec2-75-101-194-52.compute-1.amazonaws.com
> >> PRIVATE_NETWORK_NAME=amazon-ec2-us-east-1d
> >> TCP_FORWARDING_HOST = 50.16.126.154
> >> PRIVATE_NETWORK_INTERFACE = 10.112.45.248
> >> COUNT_HYPERTHREAD_CPUS = False
> >> DAEMON_LIST = MASTER, STARTD
> >> UPDATE_COLLECTOR_WITH_TCP = True
> >> #COUNT_HYPERTHREAD_CPUS = False
> >> #UPDATE_INTERVAL = $RANDOM_INTEGER(230, 370)
> >> #MASTER_UPDATE_INTERVAL = $RANDOM_INTEGER(230, 370)
> >> LOWPORT=40000
> >> HIGHPORT=40050
> >> DAEMON_LIST = MASTER, STARTD
> >>
> >> Thanks
> >> Guillaume
> >
> > Along with getting away from condor 7.2, you should get away from
> > HOSTALLOW_* and always avoid mixing HOSTALLOW_* and ALLOW_*.
> >
> > 7.2 has bugs in the TCP_FORWARDING_HOST, some discovered while creating the
> > Cloud Foundations reference architecture.
> >
> > Best,
> >
> >
> > matt
> >
>
>
>