Storage Management

Activate your FREE membership today  |  Log-in

  • Visit other TechTarget ANZ sites: 
Posted
Dec 17, 2009
 |  By:  Carol Sliwa

How NetApp and EMC implement data reduction

Bookmark and Share

This story continues from yesterday's piece on data reduction.

NetApp offers data deduplication as a feature of its Data Ontap operating system with its FAS and V-series systems. The company cites post-process deduplication as a major reason it's able to limit the deduplication performance penalty to 10% to 20% for average workloads. Writes are stored to minimize interference with application throughput. Deduplication runs later either on a scheduled basis typically during off-peak hours or automatically, based on the growth of the storage volume.

"It's always done in the background, and it's always done after the write occurs," said Larry Freeman, senior marketing manager for storage efficiency at NetApp. "If you run it more frequently, it's going to run faster because we're going to catch the duplicate blocks before there's too many of them."

Inline vs. post-process deduplication

NetApp's post-process deduplication approach contrasts with the inline, or real-time, method used by some of the popular backup dedupe products such as EMC Corp.'s Data Domain. (Some of the other backup systems use post-process deduplication.) Inline dedupe removes the duplicates as they appear and wastes little space. But Freeman claimed the performance impact on CPU resources is too high for primary storage.

"They're intercepting [the data] at the storage controller, and they have to make an immediate real-time decision: Do I store this or do I reference it?" Freeman said of inline dedupe products. "You have to compare that data object to every other object that's been stored previously. They do this with some sophisticated look-up tables and hash comparisons, but the more data is in the system, the more extensive the look-up has to be, and the slower the system becomes."

Freeman said the vendor originally expected its dedupe to be used for backup and archiving, but customers found it especially valuable for reducing VMware virtual machine disk (VMDK) files. "We promoted that and it really just took off," he said. "There was no turning back. Deduplication became the focus of primary storage."

NetApp's post-process deduplication system uses a fingerprint catalog to identify candidates for data deduplication. Each 32-byte, algorithm-created fingerprint, which is also referred to as a digital signature or hash, references a larger 4 KB data block. When the system finds two fingerprints that match, it pulls the blocks into memory and does a byte-level validation to insure against false positives or hash collisions.

Multiple-block referencing technology then kicks in. Each of the data blocks has a pointer going to it. If two blocks validate as identical, the system moves one of the data pointers to point to the same block as the first pointer and releases the duplicate block back to the free pool on the storage system.

But Freeman said NetApp's Data Ontap operating system is especially conducive to data deduplication because it includes a file system with data pointers to facilitate the multiple-block referencing. "All we needed to do to add deduplication was create a catalog of fingerprints to identify duplicate data," he said.

NetApp deduplicates any raw data on the system, whether storage-area network (SAN) or network-attached storage (NAS). The system supports deduplication on a per-volume basis, with a volume limit of 16 TB. Future plans include addressing customer requests for increased volume sizes as well as deduplication across volumes.

Space savings average out at 30% across all storage tiers, performance workloads and applications, according to Freeman. He said the company doesn't break down the storage savings by tier. But with its leading use case, VMware Inc. VMDK files, space savings are in the range of 70%, he said.

The American Association of Airport Executives claimed initial space savings of approximately 30% on 1 TB of CIFS-based shared drives and 22% on 600 GB of NFS-based data using deduplication with the NetApp FAS 3140 it rolled out in February.

"If I don't have to keep growing that volume out but I can put more on it because of dedupe, I can not only store more locally but I can replicate more and have a better disaster recovery plan. And it doesn't take up anymore bandwidth," said Patrick Osborne, senior vice president of IT at the Alexandra, Va.-based association.

But,Osborne wasn't comfortable performing deduplication on all of his data. The association elected not to deduplicate its training videos and highly sensitive biometric files out of fear of corrupting the data, he said.

"I brought it to my users and said, 'Hey, we can do this [on the NetApp FAS 3140]. We might save space, but we don't know how it's going to work.' They said no," Osborne said. "Since I was saving space in those other areas where I was really looking to save space, I was OK."

EMC Celerra

Celerra is currently the only primary storage subsystem in the EMC product family to provide primary storage data reduction. Celerra's data deduplication/compression service integrates a number of technologies that EMC acquired, including an extensible policy engine from Avamar and the compression algorithms of RecoverPoint.

A free operating system feature, Celerra Data Deduplication, works at a file level with CIFS and NFS data, and only on a per-file-system basis (file-level deduplication is also referred to as single-instance storage). That means, if the same file is located in multiple file systems, the dedupe technology couldn't reduce it to a single copy. Compression also works only on a per-file-system basis.

Using the default settings, the policy engine scans production files once per week to look for data that hasn't been accessed in 30 days. The system compresses whichever files it can and creates a unique hash for each file. It then compares the hashes to see which complete files are redundant and removes the duplicate copies. Stubs point to the files in a hidden deduplication store.

Brad Bunce, director of unified storage marketing at EMC, said the most typical and beneficial use case is general-purpose Microsoft Corp. Office shares/files and home directories. Compression generally brings 40% to 50% space savings, and its file-level deduplication produces approximately 10%, he said.

NetApp's more granular fixed-block deduplication produces at least twice the space savings, if not more. But NetApp doesn't offer compression, choosing to leave that to partners such as Storwize Inc.

"If you want to look at block-based deduplication, or deduplication of virtual machine files, for example, that's an area today that we don't compete with them at," Bunce acknowledged.

EMC's lower file-level deduplication rate is somewhat mitigated by the fact that it uses fewer system resource than fixed-block and variable-block deduplication. The resource impact of compression lies somewhere between file- and block-level deduplication, Bunce added.

Bunce said future plans for primary storage data reduction call for greater efficiency for all types of storage, whether file or block, and more granular controls for end users to selectively deduplicate and compress their own data.



TechTarget ANZ sites: SearchCIO.com.au | SearchNetworking.com.au | SearchSecurity.com.au | SearchStorage.com.au | SearchVoIP.com.au

WF Online community sites: ElectricalSolutions | ElectronicsOnline | FoodProcessing | InMotionOnline | LabOnline | ProcessOnline | RadioComms | SafetySolutions | SustainabilityMatters | Voice&Data

Copyright © 2010 Westwick-Farrow Pty Ltd. All rights reserved.
About Us | Contact Us | TechTarget