Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Avoiding combinatorial explosion in dependencies between spliced DAGS
- Date: Thu, 30 Jul 2015 13:58:34 -0500 (CDT)
- From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Avoiding combinatorial explosion in dependencies between spliced DAGS
On Thu, 30 Jul 2015, John N Calley wrote:
I make a lot of use of SPLICE-ing to compose dags into complex
workflows and these often have dependencies on each other. DAGMAN deals
with this by adding dependencies between every final node for the PARENT
dag and every initial node of the CHILD dag. When there are thousands of
initial and final nodes (as is common with my workflows) this can result
in extremely large numbers of dependencies and I've had cases where
parsing a rescue dag took quite a few hours. I've been living with this
for a while, but I recently came up with a work-around and I wondered if
others might have any thoughts on it or perhaps better ways of dealing
with the issue.
We're glad that you're finding splices useful. Hopefully we can make some
improvements to make them more useful...
What I have now started to do is to add a final NOOP job to each of my
sub-dags, so at least I just have all the dependencies from initial jobs
in the CHILD dag with this single final place-holder node. I assume that
I could do the same thing to make every one of my dags start with a NOOP
initial node that all the real initial nodes depend on, though I haven't
actually tried this. This is clearly not the intended use of the NOOP
keyword and it's a bit of a hack, so I wondered if others had better
ideas?
Hmm, I wouldn't consider this a hack. There's not really a specific
"intended" use for NOOP nodes -- they're for whatever someone finds
useful, as in this case.
Also, it would seem that it would be easy for DAGMAN to do this for me
as part of the SPLICE-ing process and the result would be a good deal
cleaner. I don't see any reason for DAGMAN not to do this. Am I missing
something? If not, please consider it a feature request.
That's actually something we thought of pretty much when splices were
first implemented. Anyhow, there is already a corresponding feature
request:
https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3587,4
I guess it's kind of languished until now because nobody has really run
into a use case where it was really necessary (or, if they did, we didn't
find out about it).
Maybe it's time to move that up in priority... At any rate, though,
there's no reason to not do it, other than its relative priority among the
several hundred outstanding DAGMan bugs/feature requests.
What I'd really like to do is to reach 'into' each sub-dag and insert
dependencies between specific final nodes and specific initial nodes.
I've considered hacking this solution together, but the ways of doing it
that I can think of seem inelegant. I wonder if anyone has thoughts on
how to do this kind of thing cleanly? To expand a bit, this comes up
when
I want to do Analysis A on samples 1-2000 and then I want to do Analysis
B on the same samples. Analysis B for sample 1 depends on Analysis A for
the same sample, but not on Analysis A for any other samples. It's a
shame to require that Analysis A finish for all samples before I start
Analysis B for any samples, but that is what I feel stuck with at the
moment.
So you're saying that right now you have all of the A nodes in one splice,
and all of the B nodes in another splice, right? I guess one thing I
would want to understand in this case is what is driving your
decomposition of the workflow. Because if you have a single splice that
has all of your As and all of your Bs, you could do this easily. Or, if
your decomposition is governed by size, you could have a splice that has
A1-A100 and B1-B100, another splice that has A101-A200, B101-B200, etc.
If you really do need to have all of the As in one splice and all of the
Bs in another I guess it might be possible to implement some kind of
"weaker" dependency between splices, wherein a given node in the
second splice only depends on some of the nodes in the first splice. That
would definitely take some thinking, though, about how the dependencies
should be specified, and this is something that hasn't come up previously,
as far as I know, so I don't have any pre-existing ideas on it.
So, to summarize:
1) There's no problem with using NOOP nodes as you describe.
2) There's no reason to not have DAGMan automatically introduce such
nodes. (This would also allow splices to have pre and post scripts, which
would make them more consistent with sub-DAGs.)
3) Before any kind of implementation of the more flexible inter-splice
dependencies, there would have to be some serious thinking involved,
probably starting with a better understanding of your use case.
Kent Wenger
CHTC Team