David A. Harding
Sunday, 19 Mar 2006
Yesterday, Greg Haase and I discussed burning copies of the LUG/IP GNU/Linux with my shell script, hell. Without thinking, I said that sum(1) is quicker at generating generating checksums than cryptographic checksum commands like md5sum. It wasn't until I suggested running a test to compare the running times of the two commands that I realized they were probably both IO-bound.
Greg agreed with my reasoning, but he said he was still curious if I wanted to test it. It isn't often that Greg and I agree on anything, and sometimes we both go to great lengths to prove our points—it's nice to be in the same boat for once. I commenced testing.
Abstract
On a modern IA32 computer under minimal load, the commands sum,
cksum, md5sum, and sha1sum all require the same amount of
effective real time to checksum a file.
Method
I,
A script,
The commands were run in order: sum, cksum, md5sum, sha1sum, sum, cksum, etc... Each command was run 10 times.
The checksums and the effective real time were logged to file.
Reasoning
I used the CD drive for two reasons. It was the subject of Greg and my
discussion and it is easy to flush (mark as dirty?) the data cache for
the CD drive by ejecting the disk. Checksumming without flushing the
data cache wouldn't be fair:
for i in `seq 1 5`
do
command time -f %e md5sum foo.dd > /dev/null
done
9.99
1.61
1.10
1.10
1.10
I waited for the system load to drop to 0.05 or lower so each command started with a clean slate.
I used a nice value of -1 to indicate the job wanted more than it's fair share of resources. I hoped this would minimise the impact of other running commands.
Results
I ensured each time the commands ran, they reported the same checksums:
wc -l checksums.txt
40 checksums.txt
sort -u checksums.txt
cksum: 3857808950 647129088 /dev/dvd
md5sum: 126751a2dc5528c2f9044d9e4ee36d61 /dev/dvd
sha1sum: 01e7e5f6142f6e5b1f1f5581aac53dc30fcc5d65 /dev/dvd
sum: 62766 631962
I looked for the min and max effective real time values for the complete data:
sort *.times | head -n 1
138.08
sort *.times | tail -n 1
140.92
Finally, I computed the mean effective running time for each command:
for f in *.times
do
echo -n "$f: " ; echo $( cat $f ) | \
echo "scale=2; ( $( sed 's/ / + /g' - ) ) / 10" | bc
done | sort -n -k2
md5sum.times: 139.23
cksum.times: 139.32
sum.times: 139.54
sha1sum.times: 140.20
Resources
The code and the test results are available as a
gzip'd tarball.
The CD used for testing was an Ubuntu Breezy Badger install CD. I chose
it because, if something horrible happend in 40 full-disk reads,
I wouldn't miss it.