On 05/30/2013 05:20 PM, Felix Wolfheimer wrote:
There's a bug in condor_starter (I'm using version 7.8.7) which affects
the execution of a HOOK_JOB_EXIT. The bug causes the starter to
terminate the hook immediately. Happens in my configuration where the
startd is configured to run only one job at a time but will probably
happen always if there's just one job running and this job terminates.
In this case the starter executes the function ShutdownGraceful in
condor_starter.V6.1/baseStarter.cpp
The code piece
if (!jobRunning) {
dprintf(D_FULLDEBUG,
"Got ShutdownGraceful when no jobs running.\n");
this->allJobsDone();
return 1;
}
is erroneous as it reports that job termination AND hook termination has
happened when it returns 1. Returning 1 leads to immediate termination
of the condor_starter and kills all running hooks. The correct version
reads:
if (!jobRunning) {
dprintf(D_FULLDEBUG,
"Got ShutdownGraceful when no jobs running.\n");
return (this->allJobsDone());
}
allJobsDone will return 0 if some hooks or other tasks are still running.
I applied the fix to my version of condor and can confirm that it works.
Please open a ticket and attach your patch.
https://htcondor-wiki.cs.wisc.edu/index.cgi/tktnew
Best,
matt
|