I'd really like the whole thing to be self-contained in one DAG like:
###
Job QueryDB querydb.job
Job Workers workers.job
Job PostProcess postprocess.job
PARENT QueryDB CHILD Workers
PARENT Workers CHILD PostProcess
###
since that seems much simpler and self-contained but I don't think
that's doable since the results of the QueryDB job determines the data
and number of worker jobs I'll need. For example, one run of QueryDB
could get 2 million results and I would create 2000 data files
containing 1000 entries each and those would be consumed by 2000
worker jobs. Another run might create only 1 data file and 1 worker. I
can't think of a way to get this all working within one DAG file.
Right now, I pass in to each worker an argument of the datafile to
process.