[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor diagram of daemons



Hi Greg & Todd,

The startd failure makes perfect sense and after doing some reading of my own, I better understand daemons and child processes.

For a schedd failure, I am still unsure. The job may finish its work and produce the appropriate exit code. But since the schedd is down, there is no shadow process to verify on the submit side that the output has arrived properly. Therefore results never make it back to the submit machine. Then following your response, if the schedd doesnât come back, the starter initializes clean-up. Do I have this right? If so, this definitely makes the case for making submit machines robust and not over-stressing them.

When schedd restarts, how does that daemon regenerate shadows from the previous schedd instance that failed? Does the starter ask the startd to go find its shadowâs parent schedd?

--------

I donât expect and kind of donât want regular submitters to care about the guts of the system. I am asking these sorts of questions due to moving from an RSE to a SysAdmin roll here at Exeter and these are the bits of lore that arenât easy to fit in online docs. Also if I want to convince my line-manager to use HTCondor on our next system, I have to be able to answer technical questions from folks who arenât familiar with the tool-suite.

Cheers,

Matt

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Greg Thain
Sent: 06 April 2022 05:52 PM
To: htcondor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] HTCondor diagram of daemons

 

CAUTION: This email originated from outside of the organisation. Do not click links or open attachments unless you recognise the sender and know the content is safe.

 

On 4/5/22 16:36, West, Matthew wrote:

Because I will be talking to RSEâs who might be skeptical that the extra process steps have tangible benefits, Iâd like to be able to explain some of the robustness features enabled by this design.

 

  1. If the Schedd goes down, what happens to the work on execute machines when it finishes? Would the Shadow still be running so the job output would be transferred?
  2. Similarly, if the Startd stops, would work carry on as normal for the jobs currently running?

 

I will definitely test out these and other daemon questions on my local minicondor instance, but I figured Iâd ask on these two first.

 

Matthew:

I'm really enjoying these questions, as they get at the heart of what HTCondor does.   One of the goals of HTCondor is to reliably run to completion workflows of jobs in the presence of networks, operating systems and machines that appear to be out to get us.   At first, one might think that this means that HTCondor should take extreme measures to keep running any job that it has started, but it turns out that this isn't quite our prime directive.  The more important property to maintain is that, in the presence of errors and crashes, that we leave the machine in a state where we can continue to operate after a reboot or restart.  So, we seek to "manage" everything we create, so that we can measure their usage and clean up after it when needed.  We try to never have a job or process that is running without supervision.

 

So, to try to maintain these properties, and as the startd manages the starter, and the starter manages the job, if the startd goes away, the starter notices, kills the job, cleans up and exits, leaving a "clean" machine behind it.

In a similar way, on the submit side, the schedd manages shadows.  If the schedd dies, the shadows exit, but because the starter is managing the job, the job continues running.  Now, if the schedd never comes back, we don't want the job to run forever, so in this case, there is a lease, a timeout whereby the starter will only continue to run the job for some fixed amount of time before it gives up hope that the schedd is returning.  At that point, it kills the job.

I hope this helps with your questions and your quest,

 

-greg