HTCondor Project List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] starter wisdom: what happens when a job exits?

Date: Tue, 25 Mar 2008 14:07:03 -0700
From: Derek Wright <wright@xxxxxxxxxxx>
Subject: Re: [Condor-devel] starter wisdom: what happens when a job exits?


On Mar 25, 2008, at 11:51 AM, Derek Wright wrote:

If you're interested in this topic and have any questions

By request, here's a copy/paste of the text, in case it's hard foryou to get to a fresh checkout from HEAD. ;)

Of course, this is just a snapshot, the copy in the source will bethe canonical copy going forward, but the text is pasted here foryour reading convenience.


Enjoy,
-Derek


except from src/condor_starter.V6.1/WISDOM:
======================================================================
What happens when a job exits?  [last updated 2008-03-01]
======================================================================

The starter's logic for when a job exits is quite complex, involvinga bunch of different inter-related pieces. It's probably difficultto get the picture just from reading individual function descriptionsin the doc++ comments, since you really need to know how everythingflows and works together.

The fundamental stages are: A) reaping and B) cleanup. The cleanupstage is further broken up into a number of steps (described below).First, we'll look at reaping.



--------------------
Reaping
--------------------

Since the starter is a "multi" starter, in the CStarter class there'sa list of all the UserProc objects it ever spawned, calledm_job_list. Any time the starter reaps a child process (viaCStarter::Reaper()), it walks through the m_job_list of activeUserProc objects and invokes the JobReaper() method on each one.This gives each UserProc a chance to take any actions once a givenprocess exits. Normally, the only time a UserProc would take actionis if the pid that just exited matched the pid of the UserProc, butin rare cases such as the ToolDaemonProtocol, a given UserProc mightcare that another one exited. If UserProc::JobReaper() returns true,it means that UserProc object is no longer active and CStarter::Reaper() moves the UserProc from the m_job_list to the m_reaped_job_list.

CStarter::Reaper() also has the logic to know if it should spawnadditional processes after a given UserProc has exited (this is forpre/post script support).

Finally, if CStarter::Reaper() sees that there are no more activeUserProc objects, it initiates the "cleanup" process...



--------------------
Cleanup
--------------------

Once all the UserProc objects have been reaped, the starter movesinto the final stages of a job exiting. There are a few differentsteps to this part of the process:

1) [optional] Invoke HOOK_JOB_EXIT
2) [optional] Starter-driven output file transfer
3) Local cleanup/exit tasks:
--- write local userlog event
--- write final job classad to a local file
--- email notification for local universe jobs, etc.
4) Final notification to our controller that the job is gone:
--- RSC to the shadow
--- qmgmt to the schedd
5) Starter finally exits (phew)

Here's how the code paths work for these steps:

When CStarter::Reaper() sees the last UserProc gone, it invokesCStarter::allJobsDone() to begin the cleanup process [step 1].

CStarter::allJobsDone() invokes JIC::allJobsDone(). The JIC takes afew actions now that the last UserProc is gone (e.g. canceling thetimer for periodic updates), and then decides to invokeHOOK_JOB_EXIT. If it invokes the hook, the JIC returns false toCStarter::allJobsDone() to halt progress on the cleanup until thehook exits. If there's no hook, the JIC returns true so thatCStarter moves on.

The next step after allJobsDone() is transferOutput() [step 2]. So,whenever allJobsDone() is finally done (immediately if there's noHOOK_JOB_EXIT, or once that hook completes), CStarter::transferOutput() is invoked. Again, this just turns around and callsJIC::transferOutput(). Only JICShadow does anything at this stage,so in all other cases, JIC::transferOutput immediately returns true.JICShadow::transferOutput() does the output file transfer (if neededgiven the job classad) in the foreground. Only if the transfer fails(e.g. transient network error and we're now disconnected) willJICShadow::transferOutput() return false. As with allJobsDone(), ifJIC::transferOutput() returns true, the CStarter is ready to move on,else, we stop the cleanup process and wait for external events (inthis case, the startd giving up and killing us, or a shadow reconnect).

After transferOutput() comes CStarter::cleanupJobs(). This iteratesover all the UserProc objects in the m_reaped_job_list and invokesUserProc::JobExit() on each one. Depending on what kind of UserProcit is, JobExit() will turn around and invoke JIC::notifyJobExit()passing in a pointer to the UserProc that exited. This is how all ofthe remaining steps are handled. For reference, here's a summary ofwhat each kind of JIC does in its notifyJobExit() implementation:

JICShadow:- write local userlog event- send update classad to thestartd (this should move higher in the JIC)- invokes shadow RSC:REMOTE_CONDOR_job_exit()

JICLocal: (nothing special for JICLocalFile or JICLocalConfig)- writelocal userlog event- write output ad to local file

JICLocalSchedd:- evaluates starter user job policy- write localuserlog event- queue management update to the schedd to save finalstatus- email notification (if requested by the job -- normallyhandled by the shadow but we have to do it ourselves for localuniverse).- write output ad to local file

Note that if there are any failures, JIC::notifyJobExit() will returnfalse, which indicates to CStarter::cleanupJobs() that the UserProcwasn't safely cleaned up (due to schedd timeout, disconnected shadow,etc) and the CStarter will leave the UserProc in them_reaped_job_list and wait for external events (a timer to retry theschedd update, a shadow reconnect, the startd giving up and hard-killing, etc). If JIC::notifyJobExit() returns true,UserProc::JobExit() returns true, which means the starter is trulydone with that UserProc. At this point, the CStarter will delete theUserProc object and remove it from the m_reaped_job_list. Oncem_reaped_job_list is empty, CStarter::cleanupJobs() callsJIC::allJobsGone(). In the case of JICLocal*, the starter finallyexits at this stage. In the case of JICShadow, the starter waitsaround for the shadow to deactivate the claim (I guess in the theorythat the shadow might decide to send it another job or something,instead, but that never happens currently).



--------------------
Retrying steps
--------------------

In general, the starter attempts to retry any steps that fail, sinceit will never move on to another phase in the cleanup if one stepfails. The basic approach here is that whenever the right externalevent comes in after an aborted step, the external event shouldalways invoke CStarter::allJobsDone() to restart the process.Everything in the JIC that's invoked as part of this processremembers how far it got, and will immediately return success (tomove on to the next step) if a certain cleanup task is already done.Since the return values are always propagated from each step, theCStarter will quickly hit the spot that needs to be retried, andcontinue on the process until completion or it hits another failureit can't recover from.

For example, if file transfer failed, the shadow reconnect handlercalls CStarter::allJobsDone(). JobInfoCommunicator::allJobsDone()knows that stage was already completed, so it returns true.CStarter::allJobsDone() sees the true and moves on to invokeJIC::transferOutput() so we retry the output.

For another example, let's say it's a local universe job and thestarter failed to do the qmgmt update, so it sets a timer to retry afew minutes later. The timer calls CStarter::allJobsDone().JIC::allJobsDone() has nothing to do and returns true. CStarter thencalls CStarter::transferOutput(), which calls JIC::transferOutput().That immediately returns success for JICLocal* and turns around andcalls CStarter::cleanupJobs(). This iterates over the UserProcobjects left in the m_reaped_job_list, and finds the one that failedto talk to the schedd, invokes UserProc::JobExit() again, which callsJIC::notifyJobExit(). JIC::notifyJobExit() is smart aboutremembering what tasks it already completed (e.g. evaluating thestarter user job policy, writing the local userlog event, etc) andsees it still hasn't successfully updated the schedd, so it tries theqmgmt operation again. Assuming that works, JIC::notifyJobExit()will finish its notification tasks by generating the job notificaitonemail (if needed), writing the final job classad to a local file (ifconfigured), and finally returns true. Once JIC::notifyJobExit()returns true, UserProc::JobExit() propagates that, the CStarterremoves the UserProc from m_reaped_job_list and deletes the UserProcobject. Assuming that's the last UserProc in m_reaped_job_list,CStarter::cleanupJobs() will call JIC::allJobsGone() and the starterexits with the appropriate exit status to tell the schedd what to dowith the job.

References:
- [Condor-devel] starter wisdom: what happens when a job exits?
  - From: Derek Wright

Prev by Date: Re: [Condor-devel] SCHEDD_NAME
Next by Date: Re: [Condor-devel] SCHEDD_NAME
Previous by thread: [Condor-devel] starter wisdom: what happens when a job exits?
Index(es):
- Date
- Thread