Re: [HTCondor-devel] [HTCondor-users] Bug in starter HOOK_JOB_EXIT terminated immediately


Date: Fri, 31 May 2013 06:46:31 -0400
From: Matthew Farrellee <matt@xxxxxxxxxx>
Subject: Re: [HTCondor-devel] [HTCondor-users] Bug in starter HOOK_JOB_EXIT terminated immediately
On 05/30/2013 05:20 PM, Felix Wolfheimer wrote:
There's a bug in condor_starter (I'm using version 7.8.7) which affects
the execution of a HOOK_JOB_EXIT. The bug causes the starter to
terminate the hook immediately. Happens in my configuration where the
startd is configured to run only one job at a time but will probably
happen always if there's just one job running and this job terminates.
In this case the starter executes the function ShutdownGraceful in
condor_starter.V6.1/baseStarter.cpp
The code piece
  if (!jobRunning) {
dprintf(D_FULLDEBUG,
"Got ShutdownGraceful when no jobs running.\n");
this->allJobsDone();
return 1;
}
is erroneous as it reports that job termination AND hook termination has
happened when it returns 1. Returning 1 leads to immediate termination
of the condor_starter and kills all running hooks. The correct version
reads:
  if (!jobRunning) {
dprintf(D_FULLDEBUG,
"Got ShutdownGraceful when no jobs running.\n");
return (this->allJobsDone());
    }
allJobsDone will return 0 if some hooks or other tasks are still running.
I applied the fix to my version of condor and can confirm that it works.

Please open a ticket and attach your patch.

https://htcondor-wiki.cs.wisc.edu/index.cgi/tktnew

Best,


matt

[← Prev in Thread] Current Thread [Next in Thread→]
  • Re: [HTCondor-devel] [HTCondor-users] Bug in starter HOOK_JOB_EXIT terminated immediately, Matthew Farrellee <=