Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [condor-users] condor_shadow timeout when loosing contact with startd
- Date: Tue, 27 Jan 2004 10:13:44 -0600
- From: Erik Paulson <epaulson@xxxxxxxxxxx>
- Subject: Re: [condor-users] condor_shadow timeout when loosing contact with startd
On Tue, Jan 27, 2004 at 01:59:22AM -0600, Derek Wright wrote:
>
> however, as you've noticed, if the machine is simply powered off or
> the kernel crashes, the socket won't necessarily be closed (at least
> the submit machine end of it won't see it). in this case, the shadow
> won't notice that the connection has been closed until the TCP stack's
> internal keep alives expire, usually 2 hours. we do open this socket
> with SO_KEEPALIVE enabled, so at least it times out eventually. :)
>
> the good news is that because of some other changes we've made for the
> 6.7.x development series, we're starting to reconsider this. so, it
> not might be too long before there's a version of condor that will
> have keep alives in the other direction, and you'd be able to
> configure the timeout that the submit machine uses before it gives up
> on a given execute machine. for now, you're out of luck. :( our
> apologies, and sorry for the potential confusion this thread might
> have caused...
>
You're not entirely out of luck - you can decrease the system-wide TCP
keepalive timer to be something smaller than 2 hours - on Linux,
it's controlled by the value in
/proc/sys/net/ipv4/tcp_keepalive_time
-Erik
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>