Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor multiple pools

Date: Tue, 31 Oct 2006 09:24:02 -0600
From: Dan Bradley <dan@xxxxxxxxxxxx>
Subject: Re: [Condor-users] Condor multiple pools



Cor Cornelisse wrote:

Condor glide-in looks a bit like overkill to me, since we'll then be
running condor within condor.

The overhead of running an extra startd for each job is typically not
significant.  However, glidein still requires bi-directional
connectivity between submit  and execute machines, so you would need to
use GCB within the glidein pool itself.  Within the underlying pools,
you would not necessarily have to use GCB, as long as you have one
public schedd per pool.  The glideins could be submitted on-demand from
some central location to each of these publicly accessible schedds.  Of
course, it would take some effort to set that all up and maintain it.


So let's say I've two condor pools and one additional submit machine, this
submit machine has bi-directional communication with both pool schedulers.
Then glidein creates sort of a "virtual" pool, and will need GCB to enable
execute machines from one pool to contact execute machines from the other
pool?! Sounds nice, would be interesting to see how much load this puts on
the network.

Yes. If there was one publicly accessible schedd per pool, you couldsubmit the glideins to these via Condor-C. These schedds would then runthe glideins on their respective pools, and the glideins would "phonehome" and become part of a pool that spans across the different parts ofyour network. The glidein pool would need GCB to provide bidirectionalconnectivity in the following cases:


central manager <--> execute machines
submit machines <--> execute machines

As Greg Thain pointed out, communication between the execute machines(e.g. for MPI) is an additional problem that would need to be worked out.

I've spend quite some time reading documentation and the only thing I
could come up with is using GCB to create one big pool. However, this
would severly affect the scalability.

From what I have seen, pools on the order of 2000 CPUs are practical,
with some attention to configuration details.  Beyond that, I lack
experience to comment.


We are talking at under a hundred boxes right now, maybe in the future it
will scale up but certainly never more than a few hundred.

Then I would say a single pool is no problem from a scalabilitystandpoint, assuming all the network problems could be worked out.

We might like to add an existing
cluster in the future and if we would be using GCB, the existing
cluster's
configuration would have to be adapted to use GCB and join our pool.

There is an active effort to make GCB less invasive, so, for example,
communication within a pool could take place without any dependence on
GCB, but communication with external submitters would use GCB.  As it
exists today, you are correct that GCB is all or nothing.

I should clarify my statement that "GCB is all or nothing". What I meanis that in order to allow condor daemons on a node to accept incomingconnections from outside of a NAT or firewall, you need to have thisnode use GCB, and this currently implies that all connections to thisnode, even from within the NAT or firewall will also involve GCB. Itdoes _not_ mean that all network traffic will pass through the GCBserver. With a suitably configured GCB routing table, it is possiblefor a direct connection to be formed in many cases. However, thecurrent implementation cannot form this direct connection without somecommunication with the GCB server, which adds latency and creates anadditional point of failure.

Let's say I have one MPI job requiring 30 cpu's, and submit it to the
cluster, which is say made up out of 2 pools with 20 worker nodes. One
condor pool needs to be able to communicate with the other one right? Even
worse, every worker node needs to be able to contact any other worker
node? Which would in my case imply adding GCB to the basic needs, which in
turn might be easiest to realize through glidein.

Whether glidein is the "easiest" approach in this case depends on howdifficult it would be for you to apply GCB to the underlying poolsversus how difficult it would be for you to submit condor glideins tothe different pools.


--Dan

References:
- Re: [Condor-users] MPI condor Config
  - From: Diego Bello
- [Condor-users] Condor multiple pools
  - From: Cor Cornelisse
- Re: [Condor-users] Condor multiple pools
  - From: Steffen Grunewald
- Re: [Condor-users] Condor multiple pools
  - From: Dan Bradley
- Re: [Condor-users] Condor multiple pools
  - From: Cor Cornelisse

Prev by Date: [Condor-users] NonCondorLoadAvg
Next by Date: Re: [Condor-users] condor_rm failing for one user because ofcredentialproblem
Previous by thread: Re: [Condor-users] Condor multiple pools
Next by thread: [Condor-users] How to incorporate a cluster on a pool?
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [Condor-users] Condor multiple pools