On Sat, 28 Jun 2014, Miha Ahronovitz wrote:
So Nick, says, I want to migrate my home grown distributed
environment to
HTCondor. As a new user he considers 3 options. Miron says use
DAGman. Miha
asks why. Miron says because it manages job dependencies. Gabriel
says
DAGman is the way to go, but he wonders "why, in case of failure,
one
has to restart the workflow rather than retry the failed jobs, "
Kent Wegner from CHTC team clarifies ans says, yes we know it is
problem,
gives the link and has a name for it: this is issue #2831.
Let me stop here. Nick seems an an experienced sysadmin /
engineer. But
HTCondor-list has 2,100 subscribers. How many of these subscribers
know
about DAGman? Maybe they search and read why, in case of failure,
they hae
resubmitt all jobs from the beginning?
Just to clarify, I was assuming (perhaps incorrectly) that Gabriel
was referring to the case where the user has to take some kind of
manual action to fix the problem with a job that failed, before
retrying that job.
If a job fails, but it may succeed on being retried without any
action from the user, the retry option in DAGMan can handle that
case. The retry option for nodes in DAGMan has existed for a long
time (10+ years, I think), so hopefully many people are aware of
that...
Kent
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/