Icon of a thin page Icon of a thick page

Silhouette Clone, rdiff-backup style: Part III

David A. Harding

[This blog is offered as an alternative to Chris Ingram's blog of almost the same name. If I understand him correctly, Chris describes using the Subversion (svn) Revision Control System (RCS) to maintain filesystem snapshots. I use a different tool, rdiff–backup, to do the same thing. I'll attempt to describe how to use rdiff–backup to do the same thing Chris does with svn.]

I once wrote a script that converted FLAC encoded files to Ogg Vorbis encoded files. That was before the command oggenc did the job itself. So a few weeks ago, while cleaning out my $HOME/bin, I deleted $HOME/bin/flac2ogg. If I have the rdiff-backup tool installed on the local computer and access to the filesystem with the backup data, I can use rdiff-backup to find the revision I want:

rdiff-backup -l /shares/data/perm/backups/home.rdiff/bin/flac2ogg
Found 1 increments:
    flac2ogg.2007-05-03T16:10:01-04:00.snapshot.gz Thu May 3 16:10:01 2007

Restoring using rdiff-backup is easy:

rdiff-backup -r 2007-05-03T16:10:01-04:00  \
        /shares/data/perm/backups/home.rdiff/bin/flac2ogg f2o.restore

The filename f2o.restore is arbitrary; this file will be created (or overwritten) with the contents of the second argument (.../bin/flac2ogg) as they were at the time of the first argument (2007-05-03T16:10:01-04:00).

Reverting a file that is changed, but not deleted, works in the same way. For example, I can compare the file I'm writing this blog in with yesterdays version:

rdiff-backup -r 1D \
        /shares/data/perm/backups/home.rdiff/doc/mine/blogs/newblogs \
        yesterdays-blog.txt
diff -u yesterdays-blog.txt doc/mine/blogs/newblogs | tail -n3
+Reverting a file that is changed, but not deleted, works in the same
+way.  For example, I can compare the file I'm writing this blog in with
+yesterdays version:

The time specifier, 1D, means 1 day ago. Every thing else is similar to the previous example.

Restoring a directory also works the same way. I can create a snapshot of my home directory as it was one hour ago with the following command:

rdiff-backup -r 1h \
        /shares/data/perm/backups/home.rdiff/ /shares/srv/home/1h-ago

The time specifier, 1h, means 1 hour ago. I put the snapshot in the /shares/srv/home/1h-ago directory to share it over the network filesystem (see my previous blog). If we run the command above, with some modifications, in the loop below, we can create a snapshot for each of the last 10 hours.

for i in `seq 1 10`
do
  rdiff-backup -r ${i}h \
        /shares/data/perm/backups/home.rdiff/ /shares/srv/home/${i}h-ago
done

Each snapshot created using this command will use the same amount disk space the original directory used when the temporally nearest backup was made. In this case, ten copies of my home directory takes up 8.1GiB of disk space.

We can trim this down significantly by using filesystem (hard) links for files with the same contents. A script that does that for the all the files beneath /shares/src/home is below:

#!/bin/bash -eu
MD5LIST=md5list

## Create a md5sum to filename pair for every file in the
## top-level snapshot directory
find /shares/srv/home/ -type f -exec md5sum '{}' \; \  >  $MD5LIST

## Find all the files with matching md5sums
#
## d41d8cd98f00b204e9800998ecf8427e is the md5 of an empty file --
##   we don't need those
grep -v ^d41d8cd98f00b204e9800998ecf8427e $MD5LIST | sort | uniq -d -w32 \
  | sed -e 's/ .*//' | while read md5
do
        ## link files with matcing md5sums together
        oldfile=''
        for file in $( grep $md5 $MD5LIST | sed -e 's/.\{34\}//' )
        do
                ln -f ${oldfile:-$file} $file || true
                oldfile=$file
        done
done

Two serious flaws in the script above are that it doesn't handle whitespace (spaces, tabs, and newlines) in filenames, and it also doesn't check, and therefore may clobber, file permissions. Both flaws are potential security problems, but can be worked around with the addition of a few tests. Another, less serious, flaw is that the code above replaces some symlinks with hardlinks.

As I suggested earlier, the space savings is significant; a side-by-side comparison of directory file size follows:

paste <(du -h --max-depth=1 /shares/srv/home.bak ) \
        <(du -h --max-depth=1 /shares/srv/home )
617M    /shares/srv/home.bak/1h-ago     592M    /shares/srv/home/1h-ago
846M    /shares/srv/home.bak/2h-ago     300M    /shares/srv/home/2h-ago
846M    /shares/srv/home.bak/3h-ago     65M     /shares/srv/home/3h-ago
846M    /shares/srv/home.bak/4h-ago     68M     /shares/srv/home/4h-ago
845M    /shares/srv/home.bak/5h-ago     65M     /shares/srv/home/5h-ago
845M    /shares/srv/home.bak/6h-ago     68M     /shares/srv/home/6h-ago
845M    /shares/srv/home.bak/7h-ago     68M     /shares/srv/home/7h-ago
845M    /shares/srv/home.bak/8h-ago     12M     /shares/srv/home/8h-ago
845M    /shares/srv/home.bak/9h-ago     68M     /shares/srv/home/9h-ago
845M    /shares/srv/home.bak/10h-ago    68M     /shares/srv/home/10h-ago
8.1G    /shares/srv/home.bak            1.4G    /shares/srv/home

The snapshot directories may now be shared using any common network filesystem. Maintaining the snapshot directories isn't as resource as intensive as creating them. Every hour, add one hour to the directory names and create a new 1h-ago snapshot:

for d in /shares/srv/home/*
do
        mv $d  $(( ${d/h-ago/} + 1 ))h-ago
done

rdiff-backup -r 1h \
        /shares/data/perm/backups/home.rdiff/ /shares/srv/home/1h-ago

(Note: ${var/match/subsitution/} is a bashism)

The linking (file space saving) script used earlier doesn't need to be changed, and can be run at any time and with any frequency.

In Conclusion
I didn't know where Chris was going when I started replying to his series of posts, but I've had fun tagging along. I hope we haven't confused you too much. There are several tools that will do snapshots better than either svn or rdiff-backup, and as Chris noted, snapshots are increasingly supported in free software filesystems. Yet svn and rdiff-backup, as non-snapshot specific programs, each provide many features generic snapshot programs don't at the cost of some extra work.

If your still interested in rdiff-backup, I do suggest you attend my friend Daniel Zuckerman's talk at the Linux User's Group in Princeton (LUG/IP) on 11 July 2007.

rdiff–backup: Part 1, Part 2, Part 3
Subversion: Part 1, Part 2, Part 3