David A. Harding
Friday, 18 May 2007
[This blog is offered as an alternative to Chris Ingram's blog of almost the same name. If I understand him correctly, Chris describes using the Subversion (svn) Revision Control System (RCS) to maintain filesystem snapshots. I use a different tool, rdiff–backup, to do the same thing. I'll attempt to describe how to use rdiff–backup to do the same thing Chris does with svn.]
I once wrote a script that converted
FLAC
encoded files to
Ogg Vorbis
encoded files. That was before the command oggenc
did the job itself. So a few weeks ago, while cleaning out my $HOME/bin,
I deleted $HOME/bin/flac2ogg. If I have the
rdiff-backup tool installed
on the local computer and access to the filesystem with the backup data,
I can use rdiff-backup to find the revision I
want:
rdiff-backup -l /shares/data/perm/backups/home.rdiff/bin/flac2ogg
Found 1 increments:
flac2ogg.2007-05-03T16:10:01-04:00.snapshot.gz Thu May 3 16:10:01 2007
Restoring using rdiff-backup is easy:
rdiff-backup -r 2007-05-03T16:10:01-04:00 \
/shares/data/perm/backups/home.rdiff/bin/flac2ogg f2o.restore
The filename f2o.restore is arbitrary; this
file will be created (or overwritten) with the contents of the second
argument (.../bin/flac2ogg) as they were
at the time of the first argument (2007-05-03T16:10:01-04:00).
Reverting a file that is changed, but not deleted, works in the same way. For example, I can compare the file I'm writing this blog in with yesterdays version:
rdiff-backup -r 1D \
/shares/data/perm/backups/home.rdiff/doc/mine/blogs/newblogs \
yesterdays-blog.txt
diff -u yesterdays-blog.txt doc/mine/blogs/newblogs | tail -n3
+Reverting a file that is changed, but not deleted, works in the same
+way. For example, I can compare the file I'm writing this blog in with
+yesterdays version:
The time specifier, 1D, means 1 day ago. Every thing else is similar to the previous example.
Restoring a directory also works the same way. I can create a snapshot of my home directory as it was one hour ago with the following command:
rdiff-backup -r 1h \
/shares/data/perm/backups/home.rdiff/ /shares/srv/home/1h-ago
The time specifier, 1h, means 1 hour ago. I put the snapshot in
the /shares/srv/home/1h-ago directory to share
it over the network filesystem (see my
previous blog).
If we run the command above, with some modifications, in the loop below,
we can create a snapshot for each of the last 10 hours.
for i in `seq 1 10` do rdiff-backup -r ${i}h \ /shares/data/perm/backups/home.rdiff/ /shares/srv/home/${i}h-ago done
Each snapshot created using this command will use the same amount disk space the original directory used when the temporally nearest backup was made. In this case, ten copies of my home directory takes up 8.1GiB of disk space.
We can trim this down significantly by using filesystem (hard) links for
files with the same contents. A script that does that for the all the
files beneath /shares/src/home is below:
#!/bin/bash -eu MD5LIST=md5list ## Create a md5sum to filename pair for every file in the ## top-level snapshot directory find /shares/srv/home/ -type f -exec md5sum '{}' \; \ > $MD5LIST ## Find all the files with matching md5sums # ## d41d8cd98f00b204e9800998ecf8427e is the md5 of an empty file -- ## we don't need those grep -v ^d41d8cd98f00b204e9800998ecf8427e $MD5LIST | sort | uniq -d -w32 \ | sed -e 's/ .*//' | while read md5 do ## link files with matcing md5sums together oldfile='' for file in $( grep $md5 $MD5LIST | sed -e 's/.\{34\}//' ) do ln -f ${oldfile:-$file} $file || true oldfile=$file done done
Two serious flaws in the script above are that it doesn't handle whitespace (spaces, tabs, and newlines) in filenames, and it also doesn't check, and therefore may clobber, file permissions. Both flaws are potential security problems, but can be worked around with the addition of a few tests. Another, less serious, flaw is that the code above replaces some symlinks with hardlinks.
As I suggested earlier, the space savings is significant; a side-by-side comparison of directory file size follows:
paste <(du -h --max-depth=1 /shares/srv/home.bak ) \
<(du -h --max-depth=1 /shares/srv/home )
617M /shares/srv/home.bak/1h-ago 592M /shares/srv/home/1h-ago
846M /shares/srv/home.bak/2h-ago 300M /shares/srv/home/2h-ago
846M /shares/srv/home.bak/3h-ago 65M /shares/srv/home/3h-ago
846M /shares/srv/home.bak/4h-ago 68M /shares/srv/home/4h-ago
845M /shares/srv/home.bak/5h-ago 65M /shares/srv/home/5h-ago
845M /shares/srv/home.bak/6h-ago 68M /shares/srv/home/6h-ago
845M /shares/srv/home.bak/7h-ago 68M /shares/srv/home/7h-ago
845M /shares/srv/home.bak/8h-ago 12M /shares/srv/home/8h-ago
845M /shares/srv/home.bak/9h-ago 68M /shares/srv/home/9h-ago
845M /shares/srv/home.bak/10h-ago 68M /shares/srv/home/10h-ago
8.1G /shares/srv/home.bak 1.4G /shares/srv/home
The snapshot directories may now be shared using any common network filesystem. Maintaining the snapshot directories isn't as resource as intensive as creating them. Every hour, add one hour to the directory names and create a new 1h-ago snapshot:
for d in /shares/srv/home/* do mv $d $(( ${d/h-ago/} + 1 ))h-ago done rdiff-backup -r 1h \ /shares/data/perm/backups/home.rdiff/ /shares/srv/home/1h-ago(Note: ${var/match/subsitution/} is a bashism)
The linking (file space saving) script used earlier doesn't need to be changed, and can be run at any time and with any frequency.
In Conclusion
I didn't know where Chris was going when I started replying to his
series of posts, but I've had fun tagging along. I hope we haven't
confused you too much. There are several tools that will do snapshots
better than either svn or rdiff-backup, and as Chris noted, snapshots
are increasingly supported in free software filesystems. Yet
svn and rdiff-backup,
as non-snapshot specific programs, each provide many features generic
snapshot programs don't at the cost of some extra work.
If your still interested in rdiff-backup, I do
suggest you attend my friend
Daniel Zuckerman's
talk at the Linux User's Group in Princeton
(LUG/IP) on
11 July 2007.
rdiff–backup:
Part 1,
Part 2,
Part 3
Subversion:
Part 1,
Part 2,
Part 3