Simply put, data de-duplication is the process of eliminating redundant bits in a storage system. But as a market it is still very much in the growing stage, a multitude of different approaches by different vendors and their products can make investigating data de-duplication anything but simple.
Among the vendors there are two essential categories: those that perform data de-duplication "in-line" and those that perform it "post-process." In-line data de-duplication is performed as data flows into the secondary storage system; post-process de-duplication is performed once data is already stored.
The advantage to in-line de-duplication is that the process is performed only once. At high enough capacities, some in-line vendors argue that post-process de-duplication can exceed backup windows. However, the advantage to post-processing de-duplication is that there are no worries about the CPU-intensive de-duplication process creating a bottleneck between the backup server and the secondary storage target.
In both cases, experts warn that users shouldn't be too cavalier with disk purchases, especially not in the beginning. "A common misunderstanding is that users will hear that they only need, say, a terabyte to store 10 terabytes (TB) of backups," said W. Curtis Preston, vice president of data protection services at GlassHouse Technologies. "Then they go out and buy a terabyte of disk, only to realise that by definition they need 10 TB for the initial backup," since it's only after that initial backup that bit-level comparisons can be made.
The vendors
Beyond the in-line vs. post-process debate, there's no shortage of differences -- and further debates to be had -- among different vendors and their approaches to de-duplication.
Data Domain has been shipping product longest and has the largest install base at 250 customers. Its appliances, which can be accessed through either a virtual tape library (VTL) or network-attached storage (NAS) interface, range from the branch office-sized DD410 model to the multipetabyte DDX series array. Data Domain performs in-line de-duplication and uses the SHA-1 algorithm and a proprietary algorithm as a secondary check. It keeps the comparison index cached in nonvolatile RAM. With Data Domain, an individual data stream is limited to 110 megabytes per second (MBps). The company says it's working on moving to a clustered architecture to aggregate performance, which should be out next year.
Diligent Technologies offers data de-duplication within its ProtecTier VTL product, which is also resold by Hitachi Data Systems (HDS). Diligent performs in-line de-duplication by keeping the comparison index in cache on Fibre Channel disk, which it claims makes the process go faster, but could also get expensive. Also in contrast to Data Domain, Diligent uses a proprietary hashing algorithm throughout its de-duplication process. Diligent claims better performance numbers than Data Domain, at 400 MBps throughput. Diligent and Data Domain largely target different market segments -- Diligent at the high end and Data Domain in the midrange. Diligent claims 150 customers.
Avamar, founded in 1999, was picked up by EMC last year for US$165 million. It was the first data de-duplication company to be acquired by a major vendor. Avamar also performs data de-duplication in-band using SHA-1, but does so at the source (the backup server), rather than at the backup target. It uses a central management node to keep track of data for comparison over the whole environment, but does the de-duplication in small chunks at each server before it's sent over the network to the backup target. As such, Avamar's de-duplication can also reduce network congestion in addition to reducing data at the secondary storage target. Avamar's de-duplication product requires the replacement of the backup environment. EMC has stated plans to incorporate it into its Legato portfolio and its VTL by next year.
ExaGrid Systems's post-process data de-duplication comes as part of its NAS backup appliance. Unlike other data de-duplication products, ExaGrid does comparisons at the byte level rather than the bit level, claiming this makes for simpler hash tables, better scalability and leaves less room for bit-level fragmentation errors. ExaGrid's product is also "content aware," which means it understands the common data patterns in major backup software products and can find duplicates accordingly.
FalconStor Software's Single-Instance Repository (SIR) feature on its VTL and IPStor product lines has yet to make a full-fledged appearance on the market. The post-process product uses the IPStor virtualisation engine and the SHA-1 algorithm (with a secondary check using the MD5 algorithm) to create a separate de-duplicated repository for long-term archive data after it is backed up to the VTL. IBM and Sun Microsystems both OEM the VTL product, though IBM does not offer SIR, and Sun will not offer it until later this year.
Quantum folded in IP, acquired with Advanced Digital Information (ADIC) last year, into the DXi3500 and DXi550 appliances in December. The in-line VTL-based de-duplication product uses a patented algorithm belonging to ADIC subsidiary RockSoft. That de-duplication has also recently been added as feature within Quantum's StorNext filesystem, also from the ADIC acquisition, which claims to be an all-in-one data migration and management engine.
NEC of America, a subsidiary of NEC Japan, offers data de-duplication as a feature within its HydraStor grid backup appliance, released in March. HydraStor's proprietary de-duplication technology, dubbed DataRedux, eliminates data duplication at the subfile level across and within incoming data streams. With HydraStor's grid architecture, controllers are added as capacity is added and every node is aware of every other node, easing performance and management issues sometimes associated with in-line products. NEC claims it reduces storage capacity by up to 75% without interrupting performance.
NetApp announced general availability of block-level data de-duplication within its NearStore R200 and FAS storage systems on May 15 after beta testing it in customer environments for the first quarter of this year. The data de-duplication development is based on NetApp's Advanced Single Instance Storage (A-SIS), from its SnapLock product. NetApp used a feature of its Write Anywhere File Layout (WAFL) to add A-SIS to its filers. WAFL already calculates a 16-bit checksum for each block of data it stores. For data de-duplication, the hashes are pulled into a database and "redundancy candidates" that look similar are identified. Those blocks are then compared bit by bit, and if they are identical, the new block is discarded. The license key is free for NearStore users and will de-duplicate data at the block level on primary storage, which makes it unique among data de-duplication schemes. However, NetApp still has yet to add the capability for its VTL, citing performance concerns.
Sepaton offers data de-duplication on its S2100-ES2 VTL through a software option called DeltaStor. The post-process de-duplication uses a proprietary "content-aware" algorithm. Sepaton's claim to fame so far in the data de-duplication world is the fact that it uses a process called forward referencing, while other products use reverse referencing. Reverse referencing creates a pointer to the original data if there are further occurrences of the original; forward referencing writes the latest version of the data and makes the previous occurrences a pointer to the most recent version. Sepaton claims this method makes restores quicker by keeping the most recent backups intact, since more recent backups are the ones most likely to be restored as a general rule.
Symantec has a product most comparable to Avamar, a software feature called PureDisk it's currently integrating with its NetBackup software. Like Avamar, the product uses a proprietary algorithm to de-duplicate data in-line and at the source. The latest version of NetBackup, 6.2, supports PureDisk to tape targets and integrates PureDisk into the Backup Reporter backup monitoring tool. Version 6.2 also supports failover between multiple PureDisk servers. The next big release for NetBackup, version 6.5, slated for announcement in June, will offer even more integration between NetBackup and PureDisk, according to early reports.
