Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Max Jobs
- Date: Tue, 03 Jun 2014 11:09:09 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Max Jobs
I'd like to add a few points to this discussion of how many jobs can be
queued:
1. We have users who regularly have 150k+ jobs in the queue on one schedd.
2. Note that the number of jobs that can be queued can grow horizontally
with HTCondor. Ie. you can always add more schedds into your pool to
manage more queued jobs if you want to submit more jobs than either one
schedd or one submit server can handle.
3. It is much faster to submit many jobs into one cluster of jobs. Not
only is it faster, but it uses much less RAM. What I mean is you are
much better off running condor_submit once with something like
executable = foo
output = output.$(Cluster).$(Process)
queue 50000
in your submit file than running condor_submit 50,000 times with
something like
executable = foo
output = output.$(Cluster).$(Process)
queue
4. If you desire to submit more jobs than a single schedd (or server)
can handle, you can today utilize DAGMan to describe a workflow with all
your jobs (hundreds of thousands or whatever), and then tell DAGMan to
limit the number of job clusters it submits into the schedd. E.g. if
you want to submit a million jobs, make a submit file like the above
that submits 5000 jobs at a time as a DAG node, then create a DAG that
submits 200 instances of the node and have DAGMan limit the number of
simultaneously submitted job clusters to just a handful (depending on
the number of machines in your pool). See http://goo.gl/tvz5rn. We
have users that regularly submit DAGs that consist of over 700k jobs.
5. We are exploring ways in the v8.3 development series to enable users
to enqueue of millions of jobs without requiring the use of DAGMan.
Hope the above pointers help,
Todd
On 6/3/2014 8:54 AM, Ben Cotton wrote:
Oops, forgot to include the link to the wiki page mentioned:
https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToManageLargeCondorPools
On Tue, Jun 3, 2014 at 9:40 AM, Ben Cotton
<ben.cotton@xxxxxxxxxxxxxxxxxx> wrote:
Suchandra,
The default value of MAX_JOBS_SUBMITTED is the largest integer
supported on your platform. However, there are some constraints that
may prevent you from reaching that limit. I have seen ~50k jobs in a
queue before, but condor_q calls can get pretty sluggish at that
point.
The HTCondor wiki[1] says "Schedd requires a minimum of ~10k RAM per
job in the job queue. For jobs with huge environment values or other
big ClassAd attributes, the requirements are larger. " Technically,
you'll need more disk space with a larger job queue, but it's such a
small percentage of even the smallest disks these days that it's not
worth worrying about.
For our customers who use CycleServer to send jobs to schedulers, we
suggest setting the maximum queue size to be about 3 times the value
of MAX_JOBS_RUNNING. If you have something similar that buffers jobs,
then that seems reasonable. If you're only submitting directly to the
scheduler, then you will need to try different values to see what
works best for your use case.
Thanks,
BC
--
Ben Cotton
main: 888.292.5320
Cycle Computing
Leader in Utility HPC Software
http://www.cyclecomputing.com
twitter: @cyclecomputing
--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing Department of Computer Sciences
HTCondor Technical Lead 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685