Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [condor-users] condor_shadow timeout when loosing contact withstartd
- Date: 27 Jan 2004 10:28:06 -0600
- From: Geoff Lovett <geoff.lovett@xxxxxxxxxxxxxxxxxxx>
- Subject: Re: [condor-users] condor_shadow timeout when loosing contact withstartd
Thanks for the info... I may try to decrease the systemwide keepalive
on my test box, but I'm not sure I want to use something low on my
production boxes. An condor-internal keepalive would be most useful!
Thanks,
Geoff
On Tue, 2004-01-27 at 01:59, Derek Wright wrote:
> On 26 Jan 2004 16:26:16 -0600 Geoff Lovett wrote:
>
> > So I'd like to get the two hours condor takes to requeue a job onto a
> > new box when there's a failure down to maybe 20 minutes. To reproduce
> > the 2 hour timeout behaviour, I'm simply running a job then turning off
> > the execute box (to simulate a crash).
> >
> > Indeed, the STARTER_UPDATE_INTERVAL hasn't decreased the timeout.
>
> sorry i didn't chime in sooner. zach's been misleading everyone. :) i
> think he's confusing the keep alive messages that the schedd sends to
> the startd. in that case, if the startd hasn't heard a few keep alive
> messages, the startd will consider the schedd dead, would kill the job
> and advertise itself as available for another job. in all public
> versions of condor, the startd will give up after missing 2 keep alive
> messages. in 6.6.1, you'll be able to configure how many keep alives
> the startd will miss before it gives up on the schedd and kills the
> job.
>
> unfortunately, there's no keep alive message sent in the other
> direction, nor any acknowledgement of the keep alive (it's just a UDP
> packet). the reason for this is that the shadow has a TCP connection
> open to the starter running on the execute machine. the assumption is
> that if anything goes wrong with the execute machine, this socket will
> be closed, the shadow will notice, and it can exit right away. this
> is true if the starter crashes, if the starter is killed, the machine
> is rebooted, etc.
>
> however, as you've noticed, if the machine is simply powered off or
> the kernel crashes, the socket won't necessarily be closed (at least
> the submit machine end of it won't see it). in this case, the shadow
> won't notice that the connection has been closed until the TCP stack's
> internal keep alives expire, usually 2 hours. we do open this socket
> with SO_KEEPALIVE enabled, so at least it times out eventually. :)
>
> the good news is that because of some other changes we've made for the
> 6.7.x development series, we're starting to reconsider this. so, it
> not might be too long before there's a version of condor that will
> have keep alives in the other direction, and you'd be able to
> configure the timeout that the submit machine uses before it gives up
> on a given execute machine. for now, you're out of luck. :( our
> apologies, and sorry for the potential confusion this thread might
> have caused...
>
> -derek
>
>
>
> Condor Support Information:
> http://www.cs.wisc.edu/condor/condor-support/
> To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
> unsubscribe condor-users <your_email_address>
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>