Dear Condor community, I hope everyone is keeping well. At our site we have a dependency on using Docker as our containerisation technology and this layer regularly needs patching with new version updates. New Docker versions usually involve draining the execution
point of jobs, patching then reintroducing the node back into a prod state. To try and reduce downtime, we’ve recently been experimenting using Docker’s Live Restore functionality (https://docs.docker.com/config/containers/live-restore/).
The outcome of this testing has been mostly positive, containers remain running with no service impact while Docker is updated or restarted.
However, I found that the startd loses all running jobs on the execution point if Docker is restarted / updated in this life-restore fashion. This leaves the environment in a state where all containers are running and continuing to function
while commands such as `condor_who` return no results. Is there a function within Condor where we can make the startd “live-restore” aware, so it maintains a list of running jobs/containers without the startd losing all running jobs? Any help in this area will be gratefully received and many thanks in advance. Best wishes, Thomas Birkett Senior Systems Administrator Scientific Computing Department Science and Technology Facilities Council (STFC) Rutherford Appleton Laboratory, Chilton, Didcot |