The important data is in the spool directory, and the two daemons that keep persistent state there are the negotiator (accounting data of past resource usage by users) and the schedd (job queue, history of past jobs, and sometimes job data files).
Shutting down the daemons during a backup is always going to be the safest option.
The accounting, job queue, and history files are written as transactional append-only logs to allow for graceful recovery from a failure. Backing up these files live should work pretty well. Depending on the order of copying the job queue and history files,
you may end up with jobs that appear in the history twice or not at all after a restore.
With any backup scheme, things will get funny if the system runs for a while between the backup and the restore. Jobs that previously completed and left the queue can come back to life. Jobs submitted after the backup no longer exist and new jobs with
the same job ids will appear. This can quickly confuse job-related state that lives under the userâs control (job data files, job event log, etc).
Thanks and regards,
Jaime Frey
UW-Madison HTCondor Project
|