[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Non trivial way of using DAG



On 10/25/23 15:53, Matthew T West via HTCondor-users wrote:

Here is the workflow tool MetOffice built to run just this sort of thing. I imagine there is a way to do it in DAGMan (maybe) but I couldn't figure out how.


We can have certain kinds of looping structures with DAGman, and we have many users with this pattern of "run a dag until some computed function converges. The trick is to have two levels, an outer dag and an inner dag. The inner dag is the one that does all the work, and the outer manages the control flow. Let's treat the inner dag as an opaque box for now, and assume that it is defined in a file "work.dag", and we don't care about the shape of the dag. Given this "work.dag", we can wrap it in an outer dag that loops with a "repeater.dag" that looks like this:


repeater.dag:
------------------

SUBDAG EXTERNAL WORK work.dag
SCRIPTÂ POST WORK workIsDone.sh
RETRY WORK 1000000

--------------------

If we condor_submit_dag repeater.dag, HTCondor will run the subdag work.dag to completion. When the last job in the work.dag is finished, dagman will run the post script "workIsDone.sh", from the repeater.dag, which runs on the access point, and thus has access to all the outputs of all the jobs in work.dag. If the script "workIsDone.sh" returns 0, dagman assumes convergence has happened, all is good, and exits. If "workIsDone.sh" returns non-zero, dagman assumes that something is wrong with node "WORK", and resubmits the whole dag. (We quietly assume that this will happen before 100000 retries).


Now, that assumes that "work.dag" is static, and is known when we run "condor_submit_dag repeater.dag". Maybe this isn't the case. And if it isn't, we can just change repeater.dag to look like


repeater.dag:
------------------
SCRIPT PRE WORK makeWorkDag.sh
SUBDAG EXTERNAL WORK work.dag
SCRIPTÂ POST WORK workIsDone.sh
RETRY WORK 1000000
--------------------


And in this case, before repeater.dag tries to run the work.dag, it runs the pre script "makeWorkDag.sh", which can generate the "work.dag" file, and any dependencies it needs.


I think this should do what you need,


-greg