Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [condor-users] Scaling to hundreds, then thousands of nodes
- Date: Wed, 03 Mar 2004 10:57:31 -0600
- From: Alain Roy <roy@xxxxxxxxxxx>
- Subject: Re: [condor-users] Scaling to hundreds, then thousands of nodes
1) We're using a single machine as the central manager and the only submit
machine. Is this inappropriate?
In general, I like to see multiple submit points and a distinct central
manager if you have thousands of jobs. It's more likely to scale easily for
you.
We recently helped set up a Condor pool with about 1,000 CPUs and 10,000
queued jobs that used only a single (Linux) submit point, and it works
well. But if you experience problems scaling to 4,000 running jobs, you may
need multiple submit machines.
Realize that each running job (not queued, idle jobs) will have a unique
process associated with it on the submit machine.
2) Can I use the CondorView module to create HTML grid-statistics pages
under Windows?
I don't know what you mean by that. Can you see this web page okay on Windows?
http://pumori.cs.wisc.edu/condor-view-applet/
Or are you asking if it can be hosted on a Windows web server? I don't know
about that. It's a set of simple script, but currently they require the
Bourne shell to run. They may be easy to get to run with Cygwin or to port,
but I don't know.
4) We're doing molecular simulations on grid nodes, and the required
bandwidth is pretty intensive. To start a simulation on a grid node
requires downloading just over 3MB of data.
How similar is the data for each job? Does every job start with the same
set of data, but different parameters? If so, there may be good ways to
prestage the data, either manually or with some clever Condor technology.
Getting simulation results requires uploading about 20MB of
data. Restarting simulations requires uploading, then downloading
anywhere from 3-20 MB of data. We want to run thousands of sims
simultaneously, all of which could be preempted during the course of a
typical school day. How can we best mitigate the exploding bandwidth
requirements? Our central manager has a direct connection to a
fiberoptic backbone connecting many schools, with T1s or T3s into the
rest. However, I worry that my central manager may get swamped with
returning files. After all, 2000 machines returning 20MB of data is
40GB, which could be problematic to say the least. Suggestions?
Standard universe has the ability to choose an "appropriate" checkpoint
server, which is a checkpoint server that you decide is "close" to the
machine the job was running on.
I realize that this doesn't translate to the vanilla universe, but I wonder
if we could do something similar. Hmmm...
5) Does Condor have an intrinsic limitation that would prevent running
thousands of jobs simultaneously?
No intrinsinc limitation.
When a job is running, it is monitored by a process back on the submit
machine, but no centralized process. So as long as your submit host isn't
overwhelmed, you can have thousands of jobs running. This is why I say that
having multiple submit points may be useful.
Let me ask you: why do you want a single submit point? Do you have a single
user submitting jobs? Are you building a web-based portal that lets users
submit jobs from their web browser? Or is it just for convenience?
The first two are good reasons for having a single submit point, and I
recognize that it's a lot of work to use multiple submit points for them. :)
-alain
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>