From:
HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Thomas Birkett - STFC UKRI via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Date: Friday, 1 December 2023 at 16:02
To: condor-users@xxxxxxxxxxx <condor-users@xxxxxxxxxxx>
Cc: Birkett, Thomas (STFC,RAL,SC) <thomas.birkett@xxxxxxxxxx>
Subject: [HTCondor-users] Condor and Docker Live-Restore
Dear Condor community,
I hope everyone is keeping well. At our site we have a dependency on using Docker as our containerisation technology and this layer regularly needs
patching with new version updates. New Docker versions usually involve draining the execution point of jobs, patching then reintroducing the node back into a prod state. To try and reduce downtime, we’ve recently been experimenting using Docker’s Live Restore
functionality (https://docs.docker.com/config/containers/live-restore/). The outcome of this testing has been mostly positive, containers remain running with no service impact while Docker
is updated or restarted.
However, I found that the startd loses all running jobs on the execution point if Docker is restarted / updated in this life-restore fashion. This
leaves the environment in a state where all containers are running and continuing to function while commands such as `condor_who` return no results. Is there a function within Condor where we can make the startd “live-restore” aware, so it maintains a list
of running jobs/containers without the startd losing all running jobs?
Any help in this area will be gratefully received and many thanks in advance.
Best wishes,
Thomas Birkett
Senior Systems Administrator
Scientific Computing Department
Science and Technology Facilities Council (STFC)
Rutherford Appleton Laboratory, Chilton, Didcot
OX11 0QX
