Icon of a thin page Icon of a thick page

Check. Sum. Mate.

David A. Harding

Yesterday, Greg Haase and I discussed burning copies of the LUG/IP GNU/Linux with my shell script, hell. Without thinking, I said that sum(1) is quicker at generating generating checksums than cryptographic checksum commands like md5sum. It wasn't until I suggested running a test to compare the running times of the two commands that I realized they were probably both IO-bound.

Greg agreed with my reasoning, but he said he was still curious if I wanted to test it. It isn't often that Greg and I agree on anything, and sometimes we both go to great lengths to prove our points—it's nice to be in the same boat for once. I commenced testing.

Abstract
On a modern IA32 computer under minimal load, the commands sum, cksum, md5sum, and sha1sum all require the same amount of effective real time to checksum a file.

Method
I,

A script,

The commands were run in order: sum, cksum, md5sum, sha1sum, sum, cksum, etc... Each command was run 10 times.

The checksums and the effective real time were logged to file.

Reasoning
I used the CD drive for two reasons. It was the subject of Greg and my discussion and it is easy to flush (mark as dirty?) the data cache for the CD drive by ejecting the disk. Checksumming without flushing the data cache wouldn't be fair:

        for i in `seq 1 5` 
        do 
                command time -f %e md5sum foo.dd > /dev/null
        done
                9.99
                1.61
                1.10
                1.10
                1.10

I waited for the system load to drop to 0.05 or lower so each command started with a clean slate.

I used a nice value of -1 to indicate the job wanted more than it's fair share of resources. I hoped this would minimise the impact of other running commands.

Results
I ensured each time the commands ran, they reported the same checksums:

        wc -l checksums.txt 
                40 checksums.txt

        sort -u checksums.txt 
                cksum: 3857808950 647129088 /dev/dvd
                md5sum: 126751a2dc5528c2f9044d9e4ee36d61  /dev/dvd
                sha1sum: 01e7e5f6142f6e5b1f1f5581aac53dc30fcc5d65  /dev/dvd
                sum: 62766 631962

I looked for the min and max effective real time values for the complete data:

        sort *.times | head -n 1
                138.08
        sort *.times | tail -n 1
                140.92

Finally, I computed the mean effective running time for each command:

        for f in *.times
        do 
                echo -n "$f: " ; echo $( cat $f ) | \
                echo "scale=2; ( $( sed 's/ / + /g' - ) ) / 10" | bc
        done | sort -n -k2 

                md5sum.times: 139.23
                cksum.times: 139.32
                sum.times: 139.54
                sha1sum.times: 140.20

Resources
The code and the test results are available as a gzip'd tarball. The CD used for testing was an Ubuntu Breezy Badger install CD. I chose it because, if something horrible happend in 40 full-disk reads, I wouldn't miss it.