Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] VM Suspend/Resume
- Date: Thu, 26 May 2016 09:59:54 +0200
- From: Laurence <lfield@xxxxxxx>
- Subject: Re: [HTCondor-users] VM Suspend/Resume
Hi Todd,
On 26/05/16 00:05, Todd Tannenbaum wrote:
On 5/24/2016 4:13 PM, Laurence Field wrote:
Hi Todd,
Hi Laurence, thank you much for this writeup. Definitely thinking on
how we can make improvements and address the shortcomings you
identified. Meanwhile some questions below....
The CCB_HEARTBEAT_INTERVAL needs to be set to repair closed connections
and we use 300s rather than the default of 20mins to speed up the
repair.
^^^ The need for this setting is a little surprising... HTCondor
should notice anytime asynchronously if the TCP connection to the CCB
server is closed, so I am guessing that when your VM is resumed the
kernel does not know the TCP socket is dead? I.e. are TCP sockets that
were open at time of suspending the VM are still considered open upon
VM resume?
Yes. From what I could see, if the VM is paused the TCP sockets remain
open and everything is fine. However, if the VM is suspended to disk the
server receives a signal that the connection has been closed but the VM
thinks the connection is still open, hence the need for the heartbeat.
The SEC_DEFAULT_SESSION_LEASE was set to 86400s so that the same ccbid
will be used on reconnection.
^^^ This one is also surprising. The only way I can think changing the
SEC_DEFAULT_SESSION_LEASE should matter is if upon resuming the VM the
system clock still has the incorrect (old) time for a some number of
seconds, and during that time HTCondor attempted to reconnect to CCB.
In other words, if your VM is suspended for 12 hours, when resuming
your VM, is the system clock immediately updated to the correct time
before processes can run, or is it possible the VM runs for a bit with
a clock showing 12 hours in the past? I don't know if the hypervisor
typically takes care of this or it is up to something in the VM
(ntpd?) to eventually resync the clock.
Yes, in my tests the clock was not automatically updating so was off by
however long the VM was suspended.
On the Sched and Sched of the CE which we use, the
SEC_DEFAULT_SESSION_LEASE is also set to 86400s (24h).
^^^ This seems like it is only required if
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION is not being used, which it
is by default starting with v8.5.1 of HTCondor. What version of
HTCondor were you using for your tests?
Authentication is being done with a user proxy. The version in the VM is
v8.0.6 while the servers are v8.3.8.
The
ALIVE_INTERVAL on the sched was increased from the default of 5 mins to
86400s (24h).
^^^ This one does not seem like it should matter at all if you are
running HTCondor v8.4.0 or above... Did you observe problems without
this setting or did you just guess it was needed?
We are not running v8.4.0 or above.
We also had to
remove the TimerRemove attribute to work around this bug
https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=5470.
^^^ The bit with TimerRemove should not be necessary if using HTCondor
v8.4.4 or above, so hoping your tests were with an earlier version
(else we may have another bug to fix).
Again, we are not running v8.4.0 or above.
Cheers,
Laurence