[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Setting up condor on ec2 machines



Hi Matt

Thanks for the quick answer ...
>From what I read I should move to latest 7.6.3 ;-)

Anyway I do believe that until I don't move one of the role outside of
EC2, I should not face firewall problems. I think all ports are open
inside a pool of machines belonging to the same EC2 reservation.

Here is the port info:

ubuntu@ec2-75-101-194-52:~$ sudo netstat -tlnp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address
State       PID/Program name
tcp        0      0 0.0.0.0:22              0.0.0.0:*
LISTEN      472/sshd
tcp        0      0 0.0.0.0:40002           0.0.0.0:*
LISTEN      3274/condor_negotia
tcp        0      0 0.0.0.0:40034           0.0.0.0:*
LISTEN      3258/condor_startd
tcp        0      0 0.0.0.0:40009           0.0.0.0:*
LISTEN      3259/condor_schedd
tcp        0      0 0.0.0.0:40014           0.0.0.0:*
LISTEN      3257/condor_master
tcp        0      0 0.0.0.0:9618            0.0.0.0:*
LISTEN      3292/condor_collect
tcp        0      0 0.0.0.0:40019           0.0.0.0:*
LISTEN      3259/condor_schedd
tcp6       0      0 :::22                   :::*
LISTEN      472/sshd
tcp6       0      0 :::1527                 :::*
LISTEN      3024/java

Will keep you updated once I will have 7.6.3 install ... I might be
able to use condor_configure.

Best Regards
Guillaume



On Thu, Aug 25, 2011 at 6:26 PM, Matthew Farrellee <matt@xxxxxxxxxx> wrote:
> (inline)
>
> On 08/25/2011 08:26 AM, tog wrote:
>>
>> Hi
>>
>> I am trying to set up a full installation on EC2 by full I mean that
>> all processes will be internal to the cloud (while I would like to
>> keep the possibility to have the manager internally to my
>> organization.
>> Therefore I decided to set the hostnames of all machines to their
>> public DNS names (which is not the default on EC2)
>
> CM+Sched -(firewall)-> {dragons} -> {EC2: Startd} ?
>
> That's a pretty popular configuration it seems. The Cloud Foundations
> architecture documents have a walk-through of that, but using Ubuntu you may
> not have access.
>
>
>> I have 2 machines running Ubuntu Lucid with condor coming from the
>> distribution itself (7.2.4)
>>
>> Master  ec2-75-101-194-52.compute-1.amazonaws.com      75.101.194.52
>> 10.116.39.128
>> Worker  ec2-50-16-126-254.compute-1.amazonaws.com      50.16.126.254
>> 10.112.45.248
>
> First problem is possibly that 7.2.4 is painfully old with many known bugs
> at this point. Newer versions include the shared_port daemon, which helps
> significantly with firewall configuration.
>
> http://spinningmatt.wordpress.com/2011/06/21/getting-started-multiple-node-condor-pool-with-firewalls/
>
>
>> I have the following error messages that I cannot explain:
>>
>> MasterLog
>>
>> 8/25 11:17:36 DaemonCore: Command Socket at<75.101.194.52:40014>
>> 8/25 11:17:36 Failed to listen(9618) on TCP command socket.
>> 8/25 11:17:36 ERROR: Create_Process failed trying to start
>> /usr/sbin/condor_collector
>> 8/25 11:17:36 restarting /usr/sbin/condor_collector in 10 seconds
>
> Failed to listen on 9618, what's on that port (netstat -tl)?
>
>
>> 8/25 11:17:36 Started DaemonCore process "/usr/sbin/condor_startd",
>> pid and pgroup = 3258
>> 8/25 11:17:36 Started DaemonCore process "/usr/sbin/condor_schedd",
>> pid and pgroup = 3259
>> 8/25 11:17:36 Started DaemonCore process
>> "/usr/sbin/condor_negotiator", pid and pgroup = 3274
>> 8/25 11:17:36 condor_write(): Socket closed when trying to write 973
>> bytes to<10.116.39.128:9618>, fd is 8
>> 8/25 11:17:36 Buf::write(): condor_write() failed
>> 8/25 11:17:36 Failed to send non-blocking update to<10.116.39.128:9618>.
>> 8/25 11:17:36 PERMISSION DENIED to unauthenticated user from host
>> 10.116.39.128 for command 60008 (DC_CHILDALIVE), access level DAEMON:
>> reason: DAEMON authorizat
>> ion policy contains no matching ALLOW entry for this request;
>> identifiers used for this host:
>> 10.116.39.128,ip-10-116-39-128.ec2.internal
>> 8/25 11:17:36 PERMISSION DENIED to unauthenticated user from host
>> 10.116.39.128 for command 60008 (DC_CHILDALIVE), access level DAEMON:
>> reason: cached result for
>>  DAEMON; see first case for the full reason
>> 8/25 11:17:41 PERMISSION DENIED to unauthenticated user from host
>> 10.116.39.128 for command 60008 (DC_CHILDALIVE), access level DAEMON:
>> reason: cached result for
>>  DAEMON; see first case for the full reason
>> 8/25 11:17:46 Failed to listen(9618) on TCP command socket.
>> 8/25 11:17:46 ERROR: Create_Process failed trying to start
>> /usr/sbin/condor_collector
>> 8/25 11:17:46 restarting /usr/sbin/condor_collector in 120 seconds
>> 8/25 11:17:46 condor_write(): Socket closed when trying to write 1057
>> bytes to unknown source, fd is 8, errno=104
>> 8/25 11:17:46 Buf::write(): condor_write() failed
>
> CHILDALIVE failing because the node is not configured to talk to itself at
> the DAEMON access level, you'll need something like ALLOW_DAEMON = ...,
> $(IP_ADDRESS), $(FULL_HOSTNAME)
>
> This isn't fatal, but it means that the master cannot watch for hung
> daemons.
>
> http://spinningmatt.wordpress.com/2009/10/21/condor_master-for-managing-processes/
>
>
>> SchedLog
>>
>> 8/25 11:18:06 (pid:1505) condor_write(): Socket closed when trying to
>> write 218 bytes to unknown source, fd is 12, errno=104
>> 8/25 11:18:06 (pid:1505) Buf::write(): condor_write() failed
>> 8/25 11:18:06 (pid:1505) All shadows are gone, exiting.
>> 8/25 11:18:06 (pid:1505) **** condor_schedd (condor_SCHEDD) pid 1505
>> EXITING WITH STATUS 0
>
> This could be a number of things. Best to leave it until you have the other
> issues resolved.
>
>
>> My configuration files are the followings:
>>
>> Master node
>>
>> PRIVATE_NETWORK_NAME=amazon-ec2-us-east-1d
>> TCP_FORWARDING_HOST=75.101.194.52
>> PRIVATE_NETWORK_INTERFACE=10.116.39.128
>> UPDATE_COLLECTOR_WITH_TCP=True
>> HOSTALLOW_WRITE=$(ALLOW_WRITE), '*.internal'
>> HOSTALLOW_READ=$(ALLOW_READ),'*.internal'
>> LOWPORT=40000
>> HIGHPORT=40050
>> COLLECTOR_SOCKET_CACHE_SIZE=1000
>>
>> '*.internal' is matching my internal hostnames
>>
>> Slave node
>>
>> ubuntu@ec2-50-16-126-254:~$ sudo more /etc/condor_conf*
>> COLLECTOR_HOST = ec2-75-101-194-52.compute-1.amazonaws.com
>> PRIVATE_NETWORK_NAME=amazon-ec2-us-east-1d
>> TCP_FORWARDING_HOST = 50.16.126.154
>> PRIVATE_NETWORK_INTERFACE = 10.112.45.248
>> COUNT_HYPERTHREAD_CPUS = False
>> DAEMON_LIST = MASTER, STARTD
>> UPDATE_COLLECTOR_WITH_TCP = True
>> #COUNT_HYPERTHREAD_CPUS = False
>> #UPDATE_INTERVAL = $RANDOM_INTEGER(230, 370)
>> #MASTER_UPDATE_INTERVAL = $RANDOM_INTEGER(230, 370)
>> LOWPORT=40000
>> HIGHPORT=40050
>> DAEMON_LIST = MASTER, STARTD
>>
>> Thanks
>> Guillaume
>
> Along with getting away from condor 7.2, you should get away from
> HOSTALLOW_* and always avoid mixing HOSTALLOW_* and ALLOW_*.
>
> 7.2 has bugs in the TCP_FORWARDING_HOST, some discovered while creating the
> Cloud Foundations reference architecture.
>
> Best,
>
>
> matt
>



-- 
PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net