HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] starter wisdom: what happens when a job exits?




On Mar 25, 2008, at 11:51 AM, Derek Wright wrote:
If you're interested in this topic and have any questions

By request, here's a copy/paste of the text, in case it's hard for you to get to a fresh checkout from HEAD. ;)

Of course, this is just a snapshot, the copy in the source will be the canonical copy going forward, but the text is pasted here for your reading convenience.

Enjoy,
-Derek


except from src/condor_starter.V6.1/WISDOM:
======================================================================
What happens when a job exits?  [last updated 2008-03-01]
======================================================================

The starter's logic for when a job exits is quite complex, involving a bunch of different inter-related pieces. It's probably difficult to get the picture just from reading individual function descriptions in the doc++ comments, since you really need to know how everything flows and works together.

The fundamental stages are: A) reaping and B) cleanup. The cleanup stage is further broken up into a number of steps (described below). First, we'll look at reaping.


--------------------
Reaping
--------------------

Since the starter is a "multi" starter, in the CStarter class there's a list of all the UserProc objects it ever spawned, called m_job_list. Any time the starter reaps a child process (via CStarter::Reaper()), it walks through the m_job_list of active UserProc objects and invokes the JobReaper() method on each one. This gives each UserProc a chance to take any actions once a given process exits. Normally, the only time a UserProc would take action is if the pid that just exited matched the pid of the UserProc, but in rare cases such as the ToolDaemonProtocol, a given UserProc might care that another one exited. If UserProc::JobReaper() returns true, it means that UserProc object is no longer active and CStarter::Reaper () moves the UserProc from the m_job_list to the m_reaped_job_list.

CStarter::Reaper() also has the logic to know if it should spawn additional processes after a given UserProc has exited (this is for pre/post script support).

Finally, if CStarter::Reaper() sees that there are no more active UserProc objects, it initiates the "cleanup" process...


--------------------
Cleanup
--------------------

Once all the UserProc objects have been reaped, the starter moves into the final stages of a job exiting. There are a few different steps to this part of the process:
1) [optional] Invoke HOOK_JOB_EXIT
2) [optional] Starter-driven output file transfer
3) Local cleanup/exit tasks:
--- write local userlog event
--- write final job classad to a local file
--- email notification for local universe jobs, etc.
4) Final notification to our controller that the job is gone:
--- RSC to the shadow
--- qmgmt to the schedd
5) Starter finally exits (phew)

Here's how the code paths work for these steps:

When CStarter::Reaper() sees the last UserProc gone, it invokes CStarter::allJobsDone() to begin the cleanup process [step 1].

CStarter::allJobsDone() invokes JIC::allJobsDone(). The JIC takes a few actions now that the last UserProc is gone (e.g. canceling the timer for periodic updates), and then decides to invoke HOOK_JOB_EXIT. If it invokes the hook, the JIC returns false to CStarter::allJobsDone() to halt progress on the cleanup until the hook exits. If there's no hook, the JIC returns true so that CStarter moves on.

The next step after allJobsDone() is transferOutput() [step 2]. So, whenever allJobsDone() is finally done (immediately if there's no HOOK_JOB_EXIT, or once that hook completes), CStarter::transferOutput () is invoked. Again, this just turns around and calls JIC::transferOutput(). Only JICShadow does anything at this stage, so in all other cases, JIC::transferOutput immediately returns true. JICShadow::transferOutput() does the output file transfer (if needed given the job classad) in the foreground. Only if the transfer fails (e.g. transient network error and we're now disconnected) will JICShadow::transferOutput() return false. As with allJobsDone(), if JIC::transferOutput() returns true, the CStarter is ready to move on, else, we stop the cleanup process and wait for external events (in this case, the startd giving up and killing us, or a shadow reconnect).

After transferOutput() comes CStarter::cleanupJobs(). This iterates over all the UserProc objects in the m_reaped_job_list and invokes UserProc::JobExit() on each one. Depending on what kind of UserProc it is, JobExit() will turn around and invoke JIC::notifyJobExit() passing in a pointer to the UserProc that exited. This is how all of the remaining steps are handled. For reference, here's a summary of what each kind of JIC does in its notifyJobExit() implementation:

JICShadow:- write local userlog event- send update classad to the startd (this should move higher in the JIC)- invokes shadow RSC: REMOTE_CONDOR_job_exit()

JICLocal: (nothing special for JICLocalFile or JICLocalConfig)- write local userlog event- write output ad to local file

JICLocalSchedd:- evaluates starter user job policy- write local userlog event- queue management update to the schedd to save final status- email notification (if requested by the job -- normally handled by the shadow but we have to do it ourselves for local universe).- write output ad to local file

Note that if there are any failures, JIC::notifyJobExit() will return false, which indicates to CStarter::cleanupJobs() that the UserProc wasn't safely cleaned up (due to schedd timeout, disconnected shadow, etc) and the CStarter will leave the UserProc in the m_reaped_job_list and wait for external events (a timer to retry the schedd update, a shadow reconnect, the startd giving up and hard- killing, etc). If JIC::notifyJobExit() returns true, UserProc::JobExit() returns true, which means the starter is truly done with that UserProc. At this point, the CStarter will delete the UserProc object and remove it from the m_reaped_job_list. Once m_reaped_job_list is empty, CStarter::cleanupJobs() calls JIC::allJobsGone(). In the case of JICLocal*, the starter finally exits at this stage. In the case of JICShadow, the starter waits around for the shadow to deactivate the claim (I guess in the theory that the shadow might decide to send it another job or something, instead, but that never happens currently).


--------------------
Retrying steps
--------------------

In general, the starter attempts to retry any steps that fail, since it will never move on to another phase in the cleanup if one step fails. The basic approach here is that whenever the right external event comes in after an aborted step, the external event should always invoke CStarter::allJobsDone() to restart the process. Everything in the JIC that's invoked as part of this process remembers how far it got, and will immediately return success (to move on to the next step) if a certain cleanup task is already done. Since the return values are always propagated from each step, the CStarter will quickly hit the spot that needs to be retried, and continue on the process until completion or it hits another failure it can't recover from.

For example, if file transfer failed, the shadow reconnect handler calls CStarter::allJobsDone(). JobInfoCommunicator::allJobsDone() knows that stage was already completed, so it returns true. CStarter::allJobsDone() sees the true and moves on to invoke JIC::transferOutput() so we retry the output.

For another example, let's say it's a local universe job and the starter failed to do the qmgmt update, so it sets a timer to retry a few minutes later. The timer calls CStarter::allJobsDone(). JIC::allJobsDone() has nothing to do and returns true. CStarter then calls CStarter::transferOutput(), which calls JIC::transferOutput(). That immediately returns success for JICLocal* and turns around and calls CStarter::cleanupJobs(). This iterates over the UserProc objects left in the m_reaped_job_list, and finds the one that failed to talk to the schedd, invokes UserProc::JobExit() again, which calls JIC::notifyJobExit(). JIC::notifyJobExit() is smart about remembering what tasks it already completed (e.g. evaluating the starter user job policy, writing the local userlog event, etc) and sees it still hasn't successfully updated the schedd, so it tries the qmgmt operation again. Assuming that works, JIC::notifyJobExit() will finish its notification tasks by generating the job notificaiton email (if needed), writing the final job classad to a local file (if configured), and finally returns true. Once JIC::notifyJobExit() returns true, UserProc::JobExit() propagates that, the CStarter removes the UserProc from m_reaped_job_list and deletes the UserProc object. Assuming that's the last UserProc in m_reaped_job_list, CStarter::cleanupJobs() will call JIC::allJobsGone() and the starter exits with the appropriate exit status to tell the schedd what to do with the job.