I need to account for failures anyway myself (and record them) so I'll
probably handle all failures and retries myself. Some of the files in
a job will fail (which is expected), so if one fails, I can't have the
whole job fail, and retrying the whole job would then retry all the
files, which won't be terribly helpful. Due to the overhead, I
can't/don't want to make each file a separate job, so grouping them in
bundles makes the most sense.
I'd really like the whole thing to be self-contained in one DAG like:
###
Job QueryDB querydb.job
Job Workers workers.job
Job PostProcess postprocess.job
PARENT QueryDB CHILD Workers
PARENT Workers CHILD PostProcess
###
since that seems much simpler and self-contained but I don't think
that's doable since the results of the QueryDB job determines the data
and number of worker jobs I'll need. For example, one run of QueryDB
could get 2 million results and I would create 2000 data files
containing 1000 entries each and those would be consumed by 2000
worker jobs. Another run might create only 1 data file and 1 worker. I
can't think of a way to get this all working within one DAG file.
Right now, I pass in to each worker an argument of the datafile to
process.