Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] restricting the number of jobs
- Date: Mon, 12 Oct 2009 15:44:23 -0500 (CDT)
- From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
- Subject: Re: [Condor-users] restricting the number of jobs
On Mon, 12 Oct 2009, Dillman, Kimberley A wrote:
I think this is because of "maxidle" versus "maxjob".
I saw a similar thing when I used a small number for "maxidle" like 3.
I had a dagman of 10 and tried to limit "maxidle" to 3. Since it
appears that dagman submits 5 jobs at a time, you still get 5 but not 10
since a job isn't "idle" until it's actually submitted. Since my dagman
submitted 5 jobs at a time, you can't get 3 so I get 5 but at least I
don't get all 10.
You can control the maximum number of submits per cycle with the
DAGMAN_MAX_SUBMITS_PER_INTERVAL configuration variable. As you surmise,
the default for this is five. If you're trying for really fine-grained
control of the number of jobs in the queue, you may want to set a lower
value.
However, if I use "maxjobs" set to 3. I get just 3 submitted at a time
since dagman can figure out maxjobs prior to submitting them.
At least I think that's how it works because that is what I interpreted
from the documentation and what I saw happen on my 10 job test dagman
script.
Yes, that's correct -- DAGMan doesn't consider jobs idle until they're
submitted. One thing to keep in mind is that maxidle is kind of a rough
setting. Jobs that are running can become idle, so you're not guaranteed
to never exceed the maxidle setting. DAGMan never *removes* jobs from the
queue to try to maintain the maxidle setting -- it just throttles how fast
it submits them.
However, the documentation appears to indicate that "maxjobs" will only
count each "job" in the dagman script and doesn't count each individual
"job" within a single submit script (i.e. queue 500 for instance would
count as 1 "job"). It does indicate that the "maxidle" option (since it
looks at each job in the queue separately for counting purposes) will
throttle a "single" job in dagman with many individual "jobs" (i.e.
queue xxx) since it counts them AFTER they are submitted so it looks at
them as individual jobs. Kind of confusing but that is why the exact
"maxidle" doesn't appear to work to the exact number since the dagman
submits the jobs in "groups" of x (5 in my case but I'm not sure exactly
where that comes from yet) and "maxidle" doesn't take affect until after
they are submitted to suppress additional submissions.
Yes, that's right. Just keep in mind that maxjobs is a "harder" limit
than maxidle is.
Kind of confusing but it seems to work okay especially if you keep the
"maxidle" to something greater than whatever the single group submission
number is for a dag.
If anyone can explain how this works in more detail, I would love to
hear about it to save some time on experimentation to figure it out. :-)
Okay, here's an explanation that hopefully makes some sense...
One thing to keep in mind with both maxidle and maxjobs is that they were
really designed to work in the situation where each submit file queues a
single job. DAGMan's ability to even handle submit files that queue more
than one job is a pretty recent addition. Also, DAGMan can only throttle
things at the level of a submit file -- so if you have submit files that
queue 10 jobs, the smallest increment that DAGMan can work with is 10
jobs...
The difference in how maxjobs and maxidle count things is more an artifact
of how the implementation works for multiple-job submit files than a
design decision.
So the background on maxidle is this: we had implemented maxjobs, but a
user who has a big pool, shared among a number of users, wanted something
more flexible than maxjobs. The problem was, they couldn't really predict
ahead of time what maxjobs should be set to, because it depended on the
load that other users were putting on their pool. So eventually someone
said, "how about if DAGMan keeps submitting jobs until the jobs aren't
getting run?" and that was how the idea of maxidle was born. The idea is
that you set maxidle to some fraction of the number of machines in your
pool, and DAGMan will keep feeding jobs in to keep the pool fully
utilized, without flooding the queue with lots of jobs that won't run for
a long time. (We have users running 500k node DAGs, so they really don't
want to submit, say 100k jobs at one shot.)
So if you want a strict limit on the number of jobs running, you should
use maxjobs, but if you want to maximize the utilization of your pool, you
should use maxidle.
Kent Wenger
Condor Team