Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] VM Suspend/Resume
- Date: Tue, 24 May 2016 23:13:22 +0200
- From: Laurence Field <Laurence.Field@xxxxxxx>
- Subject: [HTCondor-users] VM Suspend/Resume
Hi Todd,
Here is the summary of how we configured HTCondor so that jobs could
survive a VM being suspended for up to 24 hours. The context for this is
the vLHC@home volunteer computing project where machines are used
opportunistically and the VMs they run are suspended, either in memory
or to disk, when the machine is in use by the volunteer.
The Startd is running in the VM that is to be suspended. Firstly the
default value for MAX_TIME_SKIP needs to be increased from (60*20)s to
86400s (24h). This is currently hard-coded so we had to patch the
library. Without this the daemons will be restarted if the time skips
more than 20mins resulting in the jobs being lost.
https://github.com/htcondor/htcondor/blob/master/src/condor_daemon_core.V6/daemon_core.cpp#L52
Next the NOT_RESPONDING_TIMEOUT value needs to be increased from the
default of 1h to 86400s (24h) to again stop the daemons being restarted
if there is a time skip. If this is not done the master detects that the
child has hung.
The CCB_HEARTBEAT_INTERVAL needs to be set to repair closed connections
and we use 300s rather than the default of 20mins to speed up the repair.
The SEC_DEFAULT_SESSION_LEASE was set to 86400s so that the same ccbid
will be used on reconnection.
On the Collector machine we had to increase the CLASSAD_LIFETIME from
the default of 15mins to 86400s (24h) so that the Collector would not
forget about the VM. The CCB_SWEEP_INTERVAL was increased to
86400s(24h). so that connections which may have been closed are not
cleaned up prematurely. The SEC_DEFAULT_SESSION_LEASE is also set here
to 86400s (24h).
On the Sched and Sched of the CE which we use, the
SEC_DEFAULT_SESSION_LEASE is also set to 86400s (24h). The
ALIVE_INTERVAL on the sched was increased from the default of 5 mins to
86400s (24h).
Finally in the job route on the CE the JobLeaseDuration was set to
86400s (24h) so that the shadow does not die prematurely. We also had to
remove the TimerRemove attribute to work around this bug
https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=5470.
By doing all the above we managed to resume a job after suspending (both
in memory and to disk) a VM for nearly 24hours. There are two caveats
that we have spotted so far. If there is any reason why the Startd
looses contact with the Shadow, it will not be able to send the final
exit code of the job and will then wait until the JobLease has expired,
which in this case is a long time! The other is that if the network
environment changes during a suspend, e.g. you take your laptop home,
the CCB connection will not be established with the same id and the
connection with the Shadow will be lost.
This is by no means a final perfect solution, just one that we managed
to get working so comments and suggestions are welcome.
Regards,
Laurence