So Nick, says, I want to migrate my home grown distributed environment to
HTCondor. As a new user he considers 3 options. Miron says use DAGman. Miha
asks why. Miron says because it manages job dependencies. Gabriel says
DAGman is the way to go, but he wonders "why, in case of failure, one
has to restart the workflow rather than retry the failed jobs, "
Kent Wegner from CHTC team clarifies ans says, yes we know it is problem,
gives the link and has a name for it: this is issue #2831.
Let me stop here. Nick seems an an experienced sysadmin / engineer. But
HTCondor-list has 2,100 subscribers. How many of these subscribers know
about DAGman? Maybe they search and read why, in case of failure, they hae
resubmitt all jobs from the beginning?