[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Application specific scheduler



On Sat, 28 Jun 2014, Gabriel Mateescu wrote:

If there is something that may need improvement in DAGMan,
it is that I do not understand why, in case of failure, one has
to restart the workflow rather than retry the failed jobs, possibly
on different execution nodes.

You're not the first person to ask for that capability:

  https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2831,4
  https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3403,4

I don't know exactly how #2831 will happen, but hopefully in the 8.3 series...

One question for #2831 is this: how does the user notify DAGMan that a particular failed node should be retried? (This is assuming that the user has done some kind of manual fix to whatever caused the node to fail. If you just want to retry nodes without any kind of manual intervention, you can just specify retries in the DAG, although getting the retry to land on a different machine that the previous try is tricky.)

Kent Wenger
CHTC Team