Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0
- Date: Thu, 17 Aug 2006 17:11:31 +0200
- From: Horvátth Szabolcs <szabolcs@xxxxxxxxxxxxx>
- Subject: Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0
I'm not sure I understand what you're doing -- but I'm not surprised
it stopped working, as it's akin to brain surgery on a live, moving
patient. :)
Yes, did have some casulties while developing the process... ;)
Actually it does work, just after I sent the mail I re-read the config
part of the 6.8.0 docs again
and find that DAGMAN_ALLOW_EVENTS=5 basically disables all dag job
accounting trickery.
With this setting the surgery works nice and smooth.
But since it is a real hack of a solution I describe my problem in details:
If the issue is jobs which fail sometimes due to factors outside your
control, but which succeed if re-submitted, then why not use DAGMan's
RETRY feature?
The problem with the retry feature is that sometimes the output of a job
has to be checked by the user to decide whether the calculation was
successful or not.
Sometimes a small problem gets past the error checking mechanisms of the
software and it is the user that has to be able to re-submit a job.
For example because of a network problem temporally a machine can't
access a file or database. Or a DAGMan job that depends on
pre-calculated data
is accidentally run just before the data is completed by another job.
(Real world example right from this day. :))
What I'd like to do is to be able to restart a completed job (that was
submitted by dagman) with all its parent (and optionally child)
dependencies restarted.
Lets say Job A generates some data that Job B uses and deletes after it
is completed. If I want to restart Job B I need to run Job A too (to
generate the data)
and only after Job A is completed can Job B execute and run successfully.
Now the tricky part is this: if Job A calculates the data locally and
modifies the submit file of Job B to tell it where to look for that data
than simply restarting
Job B does not work, because the job in queue is not in sync with the
submit file anymore. So when Job A is rerun not only should it modify
the submit file
of Job B (just for the records, since its not resubmitted again) but
also should modify the attributes of Job B in the queue.
This is what I'd like to achieve without hacking DAGMan's settings with
DAGMAN_ALLOW_EVENTS.
Cheers,
Szabolcs
ps I still don't understand why a condor_restart command does not exist.
To restart an already completed job I have to use
condor_hold and condor_restart every time and sometimes it has a side
effect on windows and the job goes into the removed state.
Peter F. Couvares wrote:
Horvátth,
I'm not sure I understand what you're doing -- but I'm not surprised
it stopped working, as it's akin to brain surgery on a live, moving
patient. :)
If the issue is jobs which fail sometimes due to factors outside your
control, but which succeed if re-submitted, then why not use DAGMan's
RETRY feature?
If that's not sufficient, please describe the problem in a little more
detail. I'm optimistic there's a better solution than using
condor_qedit. DAGMan's underlying implementation is obviously subject
to change, so relying on a script which circumvents the supported API
& semantics is going to be fragile.
-Peter