Home > Storage Channel Tips > Data Backup and Data Protection > Explaining deduplication rates and single-instance storage to clients
Storage Channel Tips:
EMAIL THIS
 TIPS & NEWSLETTERS TOPICS 

DATA BACKUP AND DATA PROTECTION

Explaining deduplication rates and single-instance storage to clients


George Crump, Contributor
11.10.2008
Rating: -4.00- (out of 5)


Storage Channel Update
Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us    Add to Google


Solution provider takeaway: Learn how deduplication rates could influence your clients' deduplication purchasing choice, as well as how to clarify single-instance storage and perform proper testing of deduplication systems.

Deduplication almost doesn't need to be defined anymore, but just to make sure we're all on the same page, I'll define it here: It's the process of identifying redundant segments of data and storing only the unique instances of that data. The results are most beneficial in repetitive data copy processes like backup and archiving.

But exactly how beneficial will deduplication be? While storage efficiencies of 20X are not uncommon (that is, only one-twentieth of the data will need to be stored), the actual rate you see might be lower. Since you're out there on the front lines, it's critical that you set accurate customer expectations. To do that, you'll need to understand all the factors that play into deduplication rates and educate customers before they commit to a product. Those rates will vary depending on factors such as the deduplication technique being used, data types and data sources. Also, in the testing phase, it's key to do real-world testing.

Realistic deduplication rates

The first question that a customer will ask about deduplication is, "How much space will I save?" The only right answer is that it depends on a number of factors.

In addition to factors such as data type and source, which I'll discuss in more detail below, deduplication rates will vary depending on change rate, length and retention period. From a data perspective, there are two processes common to deduplication: backup and archiving.

The first full backup will generate some level of deduplication as redundancy is identified across files, volumes and servers within the enterprise. Be careful here, thoug...


Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us    Add to Google



RELATED CONTENT
Data Protection and Backup Services
Best practices for EMR storage services
EMC gives Avamar desktop, laptop support; Spectra Logic looks to leapfrog high-end tape market
Two inroads to cloud data backup services
Data deduplication software trends; Hot, warm and cold disaster recovery site options
Storage encryption: Leaving compliance out of the discussion
Using Perl to script backup jobs
How to resell cloud storage services
How to become a cloud storage services provider
Disaster recovery testing: SMB vs. enterprise
Backup design: Source-side considerations

Data Backup and Data Protection
Best practices for EMR storage services
Two inroads to cloud data backup services
Storage encryption: Leaving compliance out of the discussion
Using Perl to script backup jobs
How to resell cloud storage services
How to become a cloud storage services provider
Backup design: Source-side considerations
How to secure primary storage for life outside the data center
How to develop a backup data reduction strategy for customers
EMC/Data Domain deal: How should VARs react?

RELATED RESOURCES
2020software.com, trial software downloads for accounting software, ERP software, CRM software and business software systems
Search Bitpipe.com for the latest white papers and business webcasts
Whatis.com, the online computer dictionary


h -- some deduplication systems are not global; they only dedupe on a single server or volume. That said, typical rates for the first full backup can be 2X to 4X efficiencies.

Subsequent incremental backup jobs will typically capture efficiencies of 6X or 7X. Most of the data in an incremental backup consists of either new or modified documents or updated database or email stores. Even if the documents are new, a comparison can be made to similar files for redundant patterns. The data segments that together represent the modified files will be compared to data segments of the original copy, and only the changed segments need to be stored. Because they tend to be very large files, databases will be the big gainer. For example, a 200 GB Oracle database that only had a 1% change during the course of the day will only require the storing of 2 GB of new data rather than the entire 200 GB that would be stored without deduplication.

Subsequent full backups will see a 50X to 60X reduction in data stored. This is because, as a percentage, there is not much changed data between two full backups, and in the case of deduplication a high percentage of those changes were captured during the incremental jobs; essentially, from a storage perspective, subsequent fulls require no more space than the prior incremental.

Single-instance storage vs. deduplication

In the channel, you'll often hear the term "single-instance storage" used synonymously with deduplication. They're different, but it's easy for customers to get confused.

Single-instance storage (SIS) is a form of data reduction, but it's not data deduplication. The difference between SIS and deduplication lies in the level of granularity that can be applied. As explained above, data deduplication works at a segment or sub-block level; SIS works at the file level and eliminates redundant copies of files.

Here's an illustration of the difference between SIS and deduplication: Say there's a PowerPoint file that has been stored in each home directory of each member of the marketing department. Single-instance storage would store at least one copy of each of these files, while data deduplication would store only one. If the company changed its logo and each marketing person updated the presentation with the new logo, SIS would save all new versions of this file; data deduplication would store only the bytes that changed in each file.

An even better example of the difference between SIS and deduplication is seen with databases. Since changes are made every day to databases, at every backup, the database appears to the backup application as a new version of a file and is sent to the backup target as such. A SIS-based system would also see this as a new version of the file and store a new copy of the file each day. A data deduplication system would store only the blocks of the database that had changed from the previous night's backup.

Single-instance storage is typically implemented by the backup or archive software, whereas deduplication is typically performed at a standalone storage appliance. Software-based SIS operates only on the duplicate data that actually is processed by the backup or archive application. It is more common for redundant copies to come from a variety of sources. In databases, that can be the backup application, the built-in database backup utility or an external third-party application; in many data centers, all three are used. Deduplication systems, as a target for all of these sources, work across all of the data, yielding a much higher level of efficiency than a SIS implementation.

Block-level deduplication vs. variable segment deduplication

Any conversation with a customer about block-level deduplication vs. variable segment deduplication means they're interested in pretty technical aspects of the technology. The main challenge to a fixed block-level data deduplication comes from block shifting. Block shifting occurs when all the data in a file is rewritten on Save or Save As. The challenge is that some fixed block-level systems may identify this data as unique.

Systems that deduplicate on variable-length segments, on the other hand, anchor segments based on smaller data patterns and as a result are less sensitive to block shifting because they can pick up commonality within the file even after the file has been rewritten.

Be realistic in your testing

A common mistake resellers and their customers make when performing tests or evaluations is to evaluate for a short period of time and not run real-world simulations. When testing a deduplication system, you should test both multiple-stream backup -- a lot of data from a lot of computers -- and single-stream backup -- a large database or file from a single server. Make sure that performance is acceptable under both conditions. Test all types of data in the environment: large files, images, data from the backup applications, and direct copies from operating system or database utilities.

Most importantly, you should test recovery performance from older generations of backup data. Without proper built-in intelligence, it is quite easy for data deduplication systems to become fragmented, significantly affecting restore performance of older files and backup sets. Restore times can drop well below what they should, guaranteeing issues down the road.

Be realistic about OEM relationships

You know how this works: OEM relationships are almost always a matter of convenience. They come about when a market is legitimatized before a major vendor has had a chance to respond; they'll OEM the technology to get a toe in the market. In some cases, the vendor is not adding any value to the relationship and is merely trying to cash in on the revenue stream.

Typically, when a vendor OEMs a product, they have to handle the bulk of the support calls; because they didn't produce the product, they're not as well-equipped to handle those calls. Even organizations that are known for excellent service on their own products can be slow to understand products they OEM. Many times these OEM relationships are not designed to last, leaving customers in limbo.

In the deduplication market, a storage vendor might OEM a data deduplication application to accompany its storage product. If you recommend an OEMed deduplication system to a customer, they might end up unhappy with the recommendation due to poor support. For more on this topic, see my Channel Marker blog entry on manufacturer innovation.

References are key

As with any other data center technology gaining in popularity, suppliers are rushing to catch up with the deduplication market, and sometimes hastily so. Customers are yearning for the technology. You as the trusted advisor need to be the voice of reason. Develop a set of your own references that can speak to potential customers about their actual experience with the technology. Be prepared with these. It amazes me how often a channel representative gets that "deer in the headlights" look when asked for references. Being able to answer without hesitation will give the customer greater confidence that you're the go-to guy.

About the author

George Crump is president and founder of Storage Switzerland, an IT analyst firm focused on the storage and virtualization segments. With 25 years of experience designing storage solutions for data centers across the United States, he has seen the birth of such technologies as RAID, NAS and SAN. Prior to founding Storage Switzerland, George was chief technology officer at one of the nation's largest storage integrators, where he was in charge of technology testing, integration and product selection. Find Storage Switzerland's disclosure policy here.


Rate this Tip
To rate tips, you must be a member of SearchStorageChannel.com.
Register now to start rating these tips. Log in if you are already a member.




DISCLAIMER: Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.

HomeNewsTopicsITKnowledge ExchangeTipsMultimediaWhite PapersBlogsEvents
About Us  |  Contact Us  |  For Advertisers  |  For Business Partners  |  Site Index  |  RSS
SEARCH 
TechTarget provides technology professionals with the information they need to perform their jobs - from developing strategy, to making cost-effective purchase decisions and managing their organizations' technology projects - with its network of technology-specific websites, events and online magazines.

TechTarget Corporate Web Site  |  Media Kits  |  Site Map




All Rights Reserved, Copyright 2006 - 2009, TechTarget | Read our Privacy Policy
  TechTarget - The IT Media ROI Experts