Skip to content

File Deduplication for Single Instance Storage

Information Technology news of the past two years has been hot on “data deduplication” or “single instance storage.” I mentioned this in “Remote Backup for Data Protection“, and some of the offsite backup services that use it. Main frame systems replace all but one instance of an actual file with pointers to that one copy. To be meaningful, dupes must be defined by actual content, not merely by file name and time stamp. Some backup companies say that 80 percent savings of storage space are typical with dedupe, which helps on both costs and retrieval times.

[I do wonder what provisions are made for a disk defect in that single instance copy. I would have two or three versions of the overall backup.]

Personal computer users are now offered tools to help manage similar results. Not all offer grand catalogs with pointers, but do help users identify true duplicates before deletions. The best will use true byte by byte comparisons. Less reliable systems will use calculated file signatures for comparison, such as Cyclic redundancy check (CRC).

My first experience with such programs was with the no longer offered Dup Detector by www.prismaticsoftware.com. This examined several common formats of image files in a directory, and could be set to the degree of allowed tolerance, to find similar views of the same subject. Persons doing a lot of photo editing, or image downloading, could find this tool to be valuable, as did I. Setting of 90 percent identical found related edit versions.

Here are some currently available tools for use under MS Windows, all with trial versions. None can do the 90 percent indentical search.

Dupehunter Professional claims a lot of use in commercial and government environments, so is oriented more for command line execution than for fancy GUI interfaces. I can’t accept their exclusion, “One example is ZIP compression, which is not compatible with Dupehunter’s high standards.”, nor should you. Frankly, that reads as what we technical folks call “BS”, and I don’t mean Boy Scouts.

Duplicate Finder v3.4 (DF) lets one choose between CRC32 and byte to byte comparisons, or the faster file names mode, which I consider to be obsolete. CRC32 is probably faster than byte mode, and could be used before byte mode. Unless time is really an issue, I would probably always use byte to byte comparisons. The user interface is easy, and the choices seem to cover all possibilities. For example, one can select drives, folders to scan or ignore, chose how to handle hidden items and Windows folders, select with wildcards and extensions. DF deliberately does NOT auto delete anything, insisting that the user make the final decision. For a system that does not retain pointers to all original file locations, this seems excellent policy.

NoClone 4.x (NC) also offers both filename and exact byte comparisons; allows defining master folders to compare against and paths to exclude. Users can choose file types to be “all, image, movie, music” and wildcards, such as “*.jpg”. NC says it is “Accurate, no false duplicates.” NC distinguishes between files below or above 1 Mb size in the setup, but DF does not bother. Defining master folders can be useful.

I have worked with both DF and NC, and think both are worth serious consideration. They don’t work the same as those main frame solutions, but they also don’t cost the same. Both are under $35 for home editions. I have not tried to benchmark relative speeds. After all, it can take days to compare a million files.

Copyright 2008 by Donald A. Miller, PhD / SoftWareProgs.com,
See “S/W Store” and “Specials, Limited” for good deals on software.

Post a Comment

You must be logged in to post a comment.