Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] automatically submit a dagman rescume dag after the original DAg is done?
- Date: Thu, 13 Jul 2006 13:53:52 -0500
- From: "Peter F. Couvares" <pfc@xxxxxxxxxxx>
- Subject: Re: [Condor-users] automatically submit a dagman rescume dag after the original DAg is done?
On Jul 12, 2006, at 11:21 PM, John Wheez wrote:
Can anyone provide some pointers on how to get Dagman to auto
submit the resulting rescue file??
Are you sure you don't mean "get DAGMan to re-start automatically
after a machine crash or shutdown"?
When DAGMan creates a rescue file, it's because it can make no
further progress due to a failed node, and human intervention is
necessary. However, until recently (6.7.19?) there was a DAGMan
submission bug which prevented DAGMan from being correctly re-started
by the Condor schedd after some types of machine crashes or
shutdowns. In short, if DAGMan itself was killed by a signal, Condor
happily recorded it as an abnormal termination and let DAGMan exit
the queue, like it would for any other job, instead of restarting it.
Now DAGMan will only leave the queue if it exits of its own accord.
This includes successful completion and "I can make no further
forward progress due to failed nodes", which is when a rescue file is
produced.
If a rescue file is being produced when a simple re-submission would
allow the DAG to finish, then it would be better to use the automatic
node RETRY feature inside the first DAG, and avoid the rescue file
generation in the first place.
I hope this helps...
-Peter
--
Peter Couvares University of Wisconsin-Madison
Condor Project Research Department of Computer Sciences
pfc@xxxxxxxxxxx 1210 W. Dayton St. Rm #4241
(608) 265-8936 Madison, WI 53706-1685