Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] DAG questions
R. Kent Wenger wrote:
Unless your DAG is really "wide" (most of the 3500 nodes in the queue
at one time) upgrading to 7.4 should fix your file
Yes, ours is really wide: 100k nodes, with no dependencies. We use
MAX_JOBS to limit how many DAGMan releases at any one time. The main
reasons we use DAGMan are for pre/post scripts, the retry mechanism, and
to slowly release jobs to Condor. The jobs themselves are independent
parts of a parameter sweep. We then collect results from completed jobs.
As promised, 7.4 is much better: 26 minutes to submit the DAG with 7.2
was reduced to 7 seconds.
I'm looking for more opportunities to speed things up with DAGMan. My
new slowdown is with the rate at which DAGMan attempts to submit jobs.
I submitted my 100k node DAG around noon, and now 3 hours later I only
have 250 jobs running, 700 queued, tens of thousands left. These run
for around 5 minutes, so if we have a steady-state of 250 running jobs.
I'd at least like to have my MAX JOB limit number of jobs queued
(currently set to 2000). I have DAGMAN_MAX_SUBMITS_PER_INTERVAL=250,
which seemed reasonable, but perhaps is too low.
If it would help, I could also investigate setting up a single classad
for all the jobs and using the VARS command in the DAG file to customize
each instance.
If you're running a 7.4 DAGMan, a new feature is that you don't have
to specify a log file at all in your submit file -- if you don't,
DAGMan will assign a default log file for you. In fact, this may be
the preferred way to do things, especially if you want to re-use your
submit files in more than one DAG. The default log files are per-DAG,
so if you use the same submit file in two different DAGs you won't
have to worry about log file collisions if you use the default log
file feature.
This sounds interesting. Is there any way to force DAGMan to do this,
even if a log file is specified in the individual classad files? The
reason I ask is because I'd like to keep the layered model I have right
now where the node classads are self-contained and can be individually
submitted if required. These will need the "Log = ... " attribute.
Finally, we are working on figuring out how to monitor and visualize the
progress of our DAG. Is there some way to do DOT file generation "on
demand"? Or does someone with more experience think it is safe in our
environment (100k nodes, 6000 active, 5-10 minutes per node to complete,
500-2000 running at any given time) to have UPDATE enabled for automatic
DOT file generation? On the command/file side, it seems the dagman.out
log file and condor_q -dag are the only sources of monitoring
information pertaining to the DAGs state and progress, or are there
other places/commands I'm not aware of?
Thanks,
Ian
--
Ian Stokes-Rees W: http://sbgrid.org
ijstokes@xxxxxxxxxxxxxxxxxxx T: +1 617 432-5608 x75
SBGrid, Harvard Medical School F: +1 617 432-5600