On Fri, 26 Jul 2013, Antonio Chay wrote: First of all, glad to hear that you're using DAGMan!
- For DAGMan: How can I have one big submit file and get all the "rescue" benefits? (i.e.: re-running failed jobs only).
Unfortunately, there is no way to do this. DAGMan can only re-run jobs at the granularity of a submit file. (Basically, DAGMan is just running a bunch of condor_submit commands, so it can't really do anything that you can't do by running condor_submit on the command line.)
I don't see functionality like this coming any time soon, either -- just thinking about how to do it, even manually, things get pretty difficult.
So I guess I'd say that this is a reason to *not* have DAGMan nodes consist of a large number of procs in a single cluster.
If there's some kind of hierarchical relationship between jobs that you want to preserve in your DAGs, you might consider using splices or sub-DAGs -- that would allow your top-level DAG to still be quite simple, but you'd get the full rescue capability of DAGMan.
Kent Wenge CHTC Team