Peter F. Couvares wrote:
> Better yet, the easiest thing to do now is just to specify a POST
> script which, when it sees that the job failed, sleeps until it sees
> some special file appear, and then uses that file's (integer) contents
> as its own return code. Combined with a RETRY, this would allow a
> human to decide whether the node should succeed or retry (by writing a
> 0 or 1 to the special file, respectively), and then let DAGMan do the
> rest.
But it would require to add a sleeping post script to all of the jobs in
the dag because at the job submission
I don't know which one will fail. So for a DAGMan job of 3000 jobs I'll
get 3000 additional post scripts
that wait for the users input (that is only required for lets say 1 out
of 3000). As I said, in this case the failure
can not really be analyzed by scripts.
> Obviously this too is a short-term hack until we can give you
> something better -- but it's MUCH simpler and more robust than your
> current approach.
Ok, I think I could not make myself perfectly clear with my example. If
I had some automatic way to tell that a job failed (or probably failed)
without actually seeing the result `with me own eyes` I would have used
a simpler approach.
And because the execution of the child job is based on the data
presented by the parent when the post script finally runs I no longer
have control over a previously completed dag task. At least none that I
know of. So I can't say "ok, if this Job B failed please re-run
Job A again, and after that do reread your submit file and after that
please start again", because I don't have that kind of control.
Actually nothing can be controlled in dagman, you can't add a new job,
remove an existing, create new dependencies, set a job completed.
All user input is handled through tools that modify the queue (hold,
release, remove) and only indirectly affect dagman.