Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0
- Date: Thu, 17 Aug 2006 19:58:24 -0500
- From: "Peter F. Couvares" <pfc@xxxxxxxxxxx>
- Subject: Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0
On Aug 17, 2006, at 2:28 PM, Horvátth Szabolcs wrote:
Peter F. Couvares wrote:
Better yet, the easiest thing to do now is just to specify a POST
script which, when it sees that the job failed, sleeps until it
sees some special file appear, and then uses that file's (integer)
contents as its own return code. Combined with a RETRY, this
would allow a human to decide whether the node should succeed or
retry (by writing a 0 or 1 to the special file, respectively), and
then let DAGMan do the rest.
But it would require to add a sleeping post script to all of the
jobs in the dag because at the job submission I don't know which
one will fail. So for a DAGMan job of 3000 jobs I'll get 3000
additional post scripts that wait for the users input (that is only
required for lets say 1 out of 3000).
No, the POST script should "pause" only if the job fails, and
otherwise propagate the job's successful return code. So successful
jobs in fact require no human intervention -- but failed jobs get the
benefit of human "confirmation", as you wish. For example, the POST
script could be as follows:
#!/bin/sh
# magic_pausing_POST_script.sh
job_retval=$1
node_name=$2
if [ $job_retval -ne 0 ]; then
echo "$node_name failed; waiting for human intervention"
special_filename=please_continue.$node_name
while [ ! -f $special_filename ]; then
sleep 60;
done
new_retval=$(<$special_filename)
echo "$special_filename found! Horvátth decided that $node_name
should have returned $new_retval"
rm $special_filename
return $new_retval
fi
return 0
As long as you specify your DAG like so:
JOB foo foo.sub
SCRIPT POST foo foo.sh $RETURN $JOB
RETRY foo 10
...then the POST script will "know" the job's return code, so it can
continue if the job succeeds and pause only if it fails -- and if it
pauses, a human gets to decide if it *really* failed and needs to be
retried or whether it succeeded, and DAGMan will take care of it.
Again, this is a hack -- but it's a pretty simple and robust one as
far as hacks go, and should solve your immediate problem until DAGMan
has a runtime API for "brain surgery".
-Peter
--
Peter Couvares University of Wisconsin-Madison
Condor Project Research Department of Computer Sciences
pfc@xxxxxxxxxxx 1210 W. Dayton St. Rm #4241
(608) 265-8936 Madison, WI 53706-1685