[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Remote access to a pool [was setting up condor on ec2 machines]



Hi

After some trouble my cluster is set-up in the cloud.
What is the best way to access it from a remote machine (not having
condor installed) ? It looks like there is a SOAP interface but is
there a java api available on top of that ?

Best Regards
Guillaume

On Fri, Aug 26, 2011 at 2:13 PM, tog <guillaume.alleon@xxxxxxxxx> wrote:
> Hi,
>
> Matt, thanks to your post
> (http://spinningmatt.wordpress.com/2011/06/12/getting-started-creating-a-multiple-node-condor-pool/),
> it worked just fine with 7.6.3
>
> My main problem now is to try to get read of:
> ALLOW_WRITE = $(ALLOW_WRITE), 10.112.45.248, ip-10-112-45-248.ec2.internal
> for all nodes in my pool
>
> I guest that
> ALLOW_WRITE = $(ALLOW_WRITE), '*.ec2.internal'
>
> should work but then I am opening that to any node on EC2.
> Is there a way to accept only authenticated connections ? What is the
> impact on performance ?
>
> Then I might need to move my manager inside my organization I guess
> this can be taken care by using the condor_shared_port daemon.
>
> The other problem I do foresee is that condor is using internal names
> ... I had to put ALLOW_WRITE to ip-10-112-45-248.ec2.internal which
> will not know anymore from the manager when sitting outside EC2. Can I
> force condor to use public interface and fqdn ?
>
> Regards
> Guillaume
>
>
>
> On Thu, Aug 25, 2011 at 7:20 PM, Timothy St. Clair <tstclair@xxxxxxxxxx> wrote:
>> Just as a side note, I'm in the process of creating a full outline of
>> how to configure + setup Fedora to allow spill-over to the cloud, using
>> the JobRouter.
>>
>> I'll ping back this thread when I'm done and have something functioning.
>> I'll also make the AMI's publicly available.
>>
>> Cheers,
>> Tim
>>
>> On Thu, 2011-08-25 at 19:14 +0530, tog wrote:
>>> Hi Matt
>>>
>>> Thanks for the quick answer ...
>>> >From what I read I should move to latest 7.6.3 ;-)
>>>
>>> Anyway I do believe that until I don't move one of the role outside of
>>> EC2, I should not face firewall problems. I think all ports are open
>>> inside a pool of machines belonging to the same EC2 reservation.
>>>
>>> Here is the port info:
>>>
>>> ubuntu@ec2-75-101-194-52:~$ sudo netstat -tlnp
>>> Active Internet connections (only servers)
>>> Proto Recv-Q Send-Q Local Address           Foreign Address
>>> State       PID/Program name
>>> tcp        0      0 0.0.0.0:22              0.0.0.0:*
>>> LISTEN      472/sshd
>>> tcp        0      0 0.0.0.0:40002           0.0.0.0:*
>>> LISTEN      3274/condor_negotia
>>> tcp        0      0 0.0.0.0:40034           0.0.0.0:*
>>> LISTEN      3258/condor_startd
>>> tcp        0      0 0.0.0.0:40009           0.0.0.0:*
>>> LISTEN      3259/condor_schedd
>>> tcp        0      0 0.0.0.0:40014           0.0.0.0:*
>>> LISTEN      3257/condor_master
>>> tcp        0      0 0.0.0.0:9618            0.0.0.0:*
>>> LISTEN      3292/condor_collect
>>> tcp        0      0 0.0.0.0:40019           0.0.0.0:*
>>> LISTEN      3259/condor_schedd
>>> tcp6       0      0 :::22                   :::*
>>> LISTEN      472/sshd
>>> tcp6       0      0 :::1527                 :::*
>>> LISTEN      3024/java
>>>
>>> Will keep you updated once I will have 7.6.3 install ... I might be
>>> able to use condor_configure.
>>>
>>> Best Regards
>>> Guillaume
>>>
>>>
>>>
>>> On Thu, Aug 25, 2011 at 6:26 PM, Matthew Farrellee <matt@xxxxxxxxxx> wrote:
>>> > (inline)
>>> >
>>> > On 08/25/2011 08:26 AM, tog wrote:
>>> >>
>>> >> Hi
>>> >>
>>> >> I am trying to set up a full installation on EC2 by full I mean that
>>> >> all processes will be internal to the cloud (while I would like to
>>> >> keep the possibility to have the manager internally to my
>>> >> organization.
>>> >> Therefore I decided to set the hostnames of all machines to their
>>> >> public DNS names (which is not the default on EC2)
>>> >
>>> > CM+Sched -(firewall)-> {dragons} -> {EC2: Startd} ?
>>> >
>>> > That's a pretty popular configuration it seems. The Cloud Foundations
>>> > architecture documents have a walk-through of that, but using Ubuntu you may
>>> > not have access.
>>> >
>>> >
>>> >> I have 2 machines running Ubuntu Lucid with condor coming from the
>>> >> distribution itself (7.2.4)
>>> >>
>>> >> Master  ec2-75-101-194-52.compute-1.amazonaws.com      75.101.194.52
>>> >> 10.116.39.128
>>> >> Worker  ec2-50-16-126-254.compute-1.amazonaws.com      50.16.126.254
>>> >> 10.112.45.248
>>> >
>>> > First problem is possibly that 7.2.4 is painfully old with many known bugs
>>> > at this point. Newer versions include the shared_port daemon, which helps
>>> > significantly with firewall configuration.
>>> >
>>> > http://spinningmatt.wordpress.com/2011/06/21/getting-started-multiple-node-condor-pool-with-firewalls/
>>> >
>>> >
>>> >> I have the following error messages that I cannot explain:
>>> >>
>>> >> MasterLog
>>> >>
>>> >> 8/25 11:17:36 DaemonCore: Command Socket at<75.101.194.52:40014>
>>> >> 8/25 11:17:36 Failed to listen(9618) on TCP command socket.
>>> >> 8/25 11:17:36 ERROR: Create_Process failed trying to start
>>> >> /usr/sbin/condor_collector
>>> >> 8/25 11:17:36 restarting /usr/sbin/condor_collector in 10 seconds
>>> >
>>> > Failed to listen on 9618, what's on that port (netstat -tl)?
>>> >
>>> >
>>> >> 8/25 11:17:36 Started DaemonCore process "/usr/sbin/condor_startd",
>>> >> pid and pgroup = 3258
>>> >> 8/25 11:17:36 Started DaemonCore process "/usr/sbin/condor_schedd",
>>> >> pid and pgroup = 3259
>>> >> 8/25 11:17:36 Started DaemonCore process
>>> >> "/usr/sbin/condor_negotiator", pid and pgroup = 3274
>>> >> 8/25 11:17:36 condor_write(): Socket closed when trying to write 973
>>> >> bytes to<10.116.39.128:9618>, fd is 8
>>> >> 8/25 11:17:36 Buf::write(): condor_write() failed
>>> >> 8/25 11:17:36 Failed to send non-blocking update to<10.116.39.128:9618>.
>>> >> 8/25 11:17:36 PERMISSION DENIED to unauthenticated user from host
>>> >> 10.116.39.128 for command 60008 (DC_CHILDALIVE), access level DAEMON:
>>> >> reason: DAEMON authorizat
>>> >> ion policy contains no matching ALLOW entry for this request;
>>> >> identifiers used for this host:
>>> >> 10.116.39.128,ip-10-116-39-128.ec2.internal
>>> >> 8/25 11:17:36 PERMISSION DENIED to unauthenticated user from host
>>> >> 10.116.39.128 for command 60008 (DC_CHILDALIVE), access level DAEMON:
>>> >> reason: cached result for
>>> >>  DAEMON; see first case for the full reason
>>> >> 8/25 11:17:41 PERMISSION DENIED to unauthenticated user from host
>>> >> 10.116.39.128 for command 60008 (DC_CHILDALIVE), access level DAEMON:
>>> >> reason: cached result for
>>> >>  DAEMON; see first case for the full reason
>>> >> 8/25 11:17:46 Failed to listen(9618) on TCP command socket.
>>> >> 8/25 11:17:46 ERROR: Create_Process failed trying to start
>>> >> /usr/sbin/condor_collector
>>> >> 8/25 11:17:46 restarting /usr/sbin/condor_collector in 120 seconds
>>> >> 8/25 11:17:46 condor_write(): Socket closed when trying to write 1057
>>> >> bytes to unknown source, fd is 8, errno=104
>>> >> 8/25 11:17:46 Buf::write(): condor_write() failed
>>> >
>>> > CHILDALIVE failing because the node is not configured to talk to itself at
>>> > the DAEMON access level, you'll need something like ALLOW_DAEMON = ...,
>>> > $(IP_ADDRESS), $(FULL_HOSTNAME)
>>> >
>>> > This isn't fatal, but it means that the master cannot watch for hung
>>> > daemons.
>>> >
>>> > http://spinningmatt.wordpress.com/2009/10/21/condor_master-for-managing-processes/
>>> >
>>> >
>>> >> SchedLog
>>> >>
>>> >> 8/25 11:18:06 (pid:1505) condor_write(): Socket closed when trying to
>>> >> write 218 bytes to unknown source, fd is 12, errno=104
>>> >> 8/25 11:18:06 (pid:1505) Buf::write(): condor_write() failed
>>> >> 8/25 11:18:06 (pid:1505) All shadows are gone, exiting.
>>> >> 8/25 11:18:06 (pid:1505) **** condor_schedd (condor_SCHEDD) pid 1505
>>> >> EXITING WITH STATUS 0
>>> >
>>> > This could be a number of things. Best to leave it until you have the other
>>> > issues resolved.
>>> >
>>> >
>>> >> My configuration files are the followings:
>>> >>
>>> >> Master node
>>> >>
>>> >> PRIVATE_NETWORK_NAME=amazon-ec2-us-east-1d
>>> >> TCP_FORWARDING_HOST=75.101.194.52
>>> >> PRIVATE_NETWORK_INTERFACE=10.116.39.128
>>> >> UPDATE_COLLECTOR_WITH_TCP=True
>>> >> HOSTALLOW_WRITE=$(ALLOW_WRITE), '*.internal'
>>> >> HOSTALLOW_READ=$(ALLOW_READ),'*.internal'
>>> >> LOWPORT=40000
>>> >> HIGHPORT=40050
>>> >> COLLECTOR_SOCKET_CACHE_SIZE=1000
>>> >>
>>> >> '*.internal' is matching my internal hostnames
>>> >>
>>> >> Slave node
>>> >>
>>> >> ubuntu@ec2-50-16-126-254:~$ sudo more /etc/condor_conf*
>>> >> COLLECTOR_HOST = ec2-75-101-194-52.compute-1.amazonaws.com
>>> >> PRIVATE_NETWORK_NAME=amazon-ec2-us-east-1d
>>> >> TCP_FORWARDING_HOST = 50.16.126.154
>>> >> PRIVATE_NETWORK_INTERFACE = 10.112.45.248
>>> >> COUNT_HYPERTHREAD_CPUS = False
>>> >> DAEMON_LIST = MASTER, STARTD
>>> >> UPDATE_COLLECTOR_WITH_TCP = True
>>> >> #COUNT_HYPERTHREAD_CPUS = False
>>> >> #UPDATE_INTERVAL = $RANDOM_INTEGER(230, 370)
>>> >> #MASTER_UPDATE_INTERVAL = $RANDOM_INTEGER(230, 370)
>>> >> LOWPORT=40000
>>> >> HIGHPORT=40050
>>> >> DAEMON_LIST = MASTER, STARTD
>>> >>
>>> >> Thanks
>>> >> Guillaume
>>> >
>>> > Along with getting away from condor 7.2, you should get away from
>>> > HOSTALLOW_* and always avoid mixing HOSTALLOW_* and ALLOW_*.
>>> >
>>> > 7.2 has bugs in the TCP_FORWARDING_HOST, some discovered while creating the
>>> > Cloud Foundations reference architecture.
>>> >
>>> > Best,
>>> >
>>> >
>>> > matt
>>> >
>>>
>>>
>>>
>>
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/condor-users/
>>
>
>
>
> --
> PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net
>



-- 
PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net