Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] VM Suspend/Resume

Date: Thu, 26 May 2016 09:59:54 +0200
From: Laurence <lfield@xxxxxxx>
Subject: Re: [HTCondor-users] VM Suspend/Resume

Hi Todd,

On 26/05/16 00:05, Todd Tannenbaum wrote:

On 5/24/2016 4:13 PM, Laurence Field wrote:
Hi Todd,
Hi Laurence, thank you much for this writeup. Definitely thinking onhow we can make improvements and address the shortcomings youidentified. Meanwhile some questions below....
The CCB_HEARTBEAT_INTERVAL needs to be set to repair closed connections
and we use 300s rather than the default of 20mins to speed up therepair.
^^^ The need for this setting is a little surprising... HTCondorshould notice anytime asynchronously if the TCP connection to the CCBserver is closed, so I am guessing that when your VM is resumed thekernel does not know the TCP socket is dead? I.e. are TCP sockets thatwere open at time of suspending the VM are still considered open uponVM resume?

Yes. From what I could see, if the VM is paused the TCP sockets remainopen and everything is fine. However, if the VM is suspended to disk theserver receives a signal that the connection has been closed but the VMthinks the connection is still open, hence the need for the heartbeat.

The SEC_DEFAULT_SESSION_LEASE was set to 86400s so that the same ccbid
will be used on reconnection.
^^^ This one is also surprising. The only way I can think changing theSEC_DEFAULT_SESSION_LEASE should matter is if upon resuming the VM thesystem clock still has the incorrect (old) time for a some number ofseconds, and during that time HTCondor attempted to reconnect to CCB.In other words, if your VM is suspended for 12 hours, when resumingyour VM, is the system clock immediately updated to the correct timebefore processes can run, or is it possible the VM runs for a bit witha clock showing 12 hours in the past? I don't know if the hypervisortypically takes care of this or it is up to something in the VM(ntpd?) to eventually resync the clock.

Yes, in my tests the clock was not automatically updating so was off byhowever long the VM was suspended.

On the Sched and Sched of the CE which we use, the
SEC_DEFAULT_SESSION_LEASE is also set to 86400s (24h).
^^^ This seems like it is only required ifSEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION is not being used, which itis by default starting with v8.5.1 of HTCondor. What version ofHTCondor were you using for your tests?

Authentication is being done with a user proxy. The version in the VM isv8.0.6 while the servers are v8.3.8.

The
ALIVE_INTERVAL on the sched was increased from the default of 5 mins to
86400s (24h).
^^^ This one does not seem like it should matter at all if you arerunning HTCondor v8.4.0 or above... Did you observe problems withoutthis setting or did you just guess it was needed?

We are not running v8.4.0 or above.

We also had to
remove the TimerRemove attribute to work around this bug
https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=5470.
^^^ The bit with TimerRemove should not be necessary if using HTCondorv8.4.4 or above, so hoping your tests were with an earlier version(else we may have another bug to fix).

Again, we are not running v8.4.0 or above.

Cheers,

Laurence

References:
- [HTCondor-users] VM Suspend/Resume
  - From: Laurence Field
- Re: [HTCondor-users] VM Suspend/Resume
  - From: Todd Tannenbaum

Prev by Date: Re: [HTCondor-users] File last modification time or job last write() attribute?
Next by Date: Re: [HTCondor-users] Possibility for setting default classad in scheduler
Previous by thread: Re: [HTCondor-users] VM Suspend/Resume
Next by thread: [HTCondor-users] Possibility for setting default classad in scheduler
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] VM Suspend/Resume