Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Non trivial way of using DAG
- Date: Wed, 25 Oct 2023 16:54:43 -0500
- From: Greg Thain <gthain@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Non trivial way of using DAG
On 10/25/23 15:53, Matthew T West via HTCondor-users wrote:
Here is the workflow tool MetOffice built to run just this sort of
thing. I imagine there is a way to do it in DAGMan (maybe) but I
couldn't figure out how.
We can have certain kinds of looping structures with DAGman, and we have
many users with this pattern of "run a dag until some computed function
converges. The trick is to have two levels, an outer dag and an inner
dag. The inner dag is the one that does all the work, and the outer
manages the control flow. Let's treat the inner dag as an opaque box
for now, and assume that it is defined in a file "work.dag", and we
don't care about the shape of the dag. Given this "work.dag", we can
wrap it in an outer dag that loops with a "repeater.dag" that looks like
this:
repeater.dag:
------------------
SUBDAG EXTERNAL WORK work.dag
SCRIPTÂ POST WORK workIsDone.sh
RETRY WORK 1000000
--------------------
If we condor_submit_dag repeater.dag, HTCondor will run the subdag
work.dag to completion. When the last job in the work.dag is finished,
dagman will run the post script "workIsDone.sh", from the repeater.dag,
which runs on the access point, and thus has access to all the outputs
of all the jobs in work.dag. If the script "workIsDone.sh" returns 0,
dagman assumes convergence has happened, all is good, and exits. If
"workIsDone.sh" returns non-zero, dagman assumes that something is wrong
with node "WORK", and resubmits the whole dag. (We quietly assume that
this will happen before 100000 retries).
Now, that assumes that "work.dag" is static, and is known when we run
"condor_submit_dag repeater.dag". Maybe this isn't the case. And if it
isn't, we can just change repeater.dag to look like
repeater.dag:
------------------
SCRIPT PRE WORK makeWorkDag.sh
SUBDAG EXTERNAL WORK work.dag
SCRIPTÂ POST WORK workIsDone.sh
RETRY WORK 1000000
--------------------
And in this case, before repeater.dag tries to run the work.dag, it runs
the pre script "makeWorkDag.sh", which can generate the "work.dag" file,
and any dependencies it needs.
I think this should do what you need,
-greg