==========================================
  Simple Incremental Backups (Encrypted)
==========================================

:Author: Martin Blais <blais@furius.ca>
:Date: 2005-07-23
:Abstract:

   A solution for my offsite backups solution.


Forces
======

We want:

- Networked backup: the size of backup data must be kept somewhat low;

- Incremental backups: we do not want to copy ALL the files everytime.  While we
  do not need to optimize the backup size, we want to keep is relatively
  low. "As low as possible" would mean doing file diffs.

- We want a simple solution.  I really mean *simple*.  You should be able to
  understand the algorithm and read the source code within 30 mins.

- It should be possible to store the backup files in encrypted form and on a
  remote platform.  This the backed up data is assumed to be inaccessible while
  we're carrying out the backup.

- We want to reuse existing and open source tools as must as possible.  tar,
  bz2, gzip, diff, gpg.

- Our code should be tested: there are people out the providing open source
  backup solutions which are buggy and not tested.  They don't understand that
  untested backups are no better than no backups at all.


Solution
========

Some definitions:

  Client
    The machine that needs to be backed up.

  Server
    The machine where backed up data archives are to be sent.

The algorithms
--------------

Backup
~~~~~~

A file with the lists of files that were last backed up, including a CRC for
each file (e.g. an MD5 sum) is kept and maintained on the client.  If the file
is not available, a full backup should be run.
 
1. We load the last-backed-up file list;

2. We compare against the current list of files to be backed up:

   - Files that have been added are marked to be backed up;
   - Files that have changed (their MD5 sums or timestamps differ) are marked to
     be backed up.

3. We tar/zip (.tar.bz2) the list of files that have been added or changed into
   a backup archive file;

4. We also save a full listing of the files that are present at the moment of
   backup, in a special file in the backup archive;

5. The backup archive is optionally encrypted with a key, named with the current
   date/time, and sent to the server, to be put in a directory along with the
   other archives;

Restore
~~~~~~~

We assume that we have a list of backup archives, whose union contain at least
one copy of the files in file list stored in a backup archive.

1. We open the latest archive file for the date at which we want to restore the
   backup and fetch the special file list for that date;

2. We fetch the list of files stored within all the archives that precede the
   date for which we want to restore.  We do this lazily--we only open the files
   if we need to (see below).

3. For each of the files in the list of files to restore: look for it from
   latest to earliest archive file.  Open the archives as needed (and keep their
   file lists cached).

   a. When a file is found, untar it into the restored output directory;

   b. If we run at the end of the list of archives, output and error and
      continue;


Another idea for restore would be to untar all the archives in order over each
other and then to remove the files that are not supposed to be present.  This
might actually be faster than the lookup I was planning to do.


Notes
=====

- We do not perform diffs.  Backups could be smaller with diffs, but complicate
  the algorithm.  This is in the interest of simplicity.

- We do not optimize file moves.  Tough luck.



Links
=====

See this thread:
http://ask.slashdot.org/article.pl?sid=05/07/22/187217&tid=198&tid=95&tid=4


