Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0
- Date: Thu, 17 Aug 2006 21:28:53 +0200
- From: Horvátth Szabolcs <szabolcs@xxxxxxxxxxxxx>
- Subject: Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0
Peter F. Couvares wrote:
Better yet, the easiest thing to do now is just to specify a POST
script which, when it sees that the job failed, sleeps until it sees
some special file appear, and then uses that file's (integer) contents
as its own return code. Combined with a RETRY, this would allow a
human to decide whether the node should succeed or retry (by writing a
0 or 1 to the special file, respectively), and then let DAGMan do the
rest.
But it would require to add a sleeping post script to all of the jobs in
the dag because at the job submission
I don't know which one will fail. So for a DAGMan job of 3000 jobs I'll
get 3000 additional post scripts
that wait for the users input (that is only required for lets say 1 out
of 3000). As I said, in this case the failure
can not really be analyzed by scripts.
Obviously this too is a short-term hack until we can give you
something better -- but it's MUCH simpler and more robust than your
current approach.
Ok, I think I could not make myself perfectly clear with my example. If
I had some automatic way to tell that a job failed (or probably failed)
without actually seeing the result `with me own eyes` I would have used
a simpler approach.
And because the execution of the child job is based on the data
presented by the parent when the post script finally runs I no longer
have control over a previously completed dag task. At least none that I
know of. So I can't say "ok, if this Job B failed please re-run
Job A again, and after that do reread your submit file and after that
please start again", because I don't have that kind of control.
Actually nothing can be controlled in dagman, you can't add a new job,
remove an existing, create new dependencies, set a job completed.
All user input is handled through tools that modify the queue (hold,
release, remove) and only indirectly affect dagman.
Cheers,
Szabolcs