I'm not sure I understand what you're doing -- but I'm not
surprised it stopped working, as it's akin to brain surgery on a
live, moving patient. :)
Yes, did have some casulties while developing the process... ;)
Actually it does work, just after I sent the mail I re-read the
config part of the 6.8.0 docs again
and find that DAGMAN_ALLOW_EVENTS=5 basically disables all dag job
accounting trickery.
With this setting the surgery works nice and smooth.
But since it is a real hack of a solution I describe my problem in
details:
If the issue is jobs which fail sometimes due to factors outside
your control, but which succeed if re-submitted, then why not use
DAGMan's RETRY feature?
The problem with the retry feature is that sometimes the output of
a job has to be checked by the user to decide whether the
calculation was successful or not.
Sometimes a small problem gets past the error checking mechanisms
of the software and it is the user that has to be able to re-submit
a job.
For example because of a network problem temporally a machine can't
access a file or database. Or a DAGMan job that depends on pre-
calculated data
is accidentally run just before the data is completed by another
job. (Real world example right from this day. :))
What I'd like to do is to be able to restart a completed job (that
was submitted by dagman) with all its parent (and optionally child)
dependencies restarted.
Lets say Job A generates some data that Job B uses and deletes
after it is completed. If I want to restart Job B I need to run Job
A too (to generate the data)
and only after Job A is completed can Job B execute and run
successfully.
Now the tricky part is this: if Job A calculates the data locally
and modifies the submit file of Job B to tell it where to look for
that data than simply restarting
Job B does not work, because the job in queue is not in sync with
the submit file anymore. So when Job A is rerun not only should it
modify the submit file
of Job B (just for the records, since its not resubmitted again)
but also should modify the attributes of Job B in the queue.
This is what I'd like to achieve without hacking DAGMan's settings
with DAGMAN_ALLOW_EVENTS.
Cheers,
Szabolcs
ps I still don't understand why a condor_restart command does not
exist. To restart an already completed job I have to use
condor_hold and condor_restart every time and sometimes it has a
side effect on windows and the job goes into the removed state.
Peter F. Couvares wrote:
Horvátth,
I'm not sure I understand what you're doing -- but I'm not
surprised it stopped working, as it's akin to brain surgery on a
live, moving patient. :)
If the issue is jobs which fail sometimes due to factors outside
your control, but which succeed if re-submitted, then why not use
DAGMan's RETRY feature?
If that's not sufficient, please describe the problem in a little
more detail. I'm optimistic there's a better solution than using
condor_qedit. DAGMan's underlying implementation is obviously
subject to change, so relying on a script which circumvents the
supported API & semantics is going to be fragile.
-Peter