0711 st eguide datadeduplicationbackup v2
Post on 04-Jun-2018
221 Views
Preview:
TRANSCRIPT
-
8/13/2019 0711 ST eGuide DataDeduplicationBackup v2
1/26
Managing the information that drives the enterprise
STORAGE
DataDeduplicationandBackup
ESSENTIAL GUIDE TO
INSIDE
Data dedupe tutorial
Source-side deduplication
How to restore deduped data
Solving common dedupe problems
What you need to know about global dedupe
Whether youre new to data deduplication,or want to brush up on the fundamentals,we have all the answers in our essentialguide on data dedupe and backup.
http://www.techtarget.com/ -
8/13/2019 0711 ST eGuide DataDeduplicationBackup v2
2/26
Quantums DXi-Series Applianceswith deduplication
provide higher performance at lower cost than the
leading competitor.
Preserving The Worlds Most Important Data. Yours.
Contact us to learn more at (866) 809-5230or visit www.quantum.com/dxi
2011 Quantum Corporation. All rights reserved.
Quantum has helped some of the largest organizations in the worldintegrate
deduplication into their backup process. The benefits they report are immediate and
significantfaster backup and restore, 90%+ reduction in disk needs, automated DR
using remote replication, reduced administration timeall while lowering overall costs
and improving the bottom line.
Our award-winning DXi-Series appliances deliver a smart, time-saving approach
to disk backup. They are acknowledged technical leaders. In fact, our DXi6500 was
just nominated as a Best Backup Hardware finalist in Storage Magazines Best
Product of the Year Awardsits both faster and up to 45% less expensive than theleading competitor.
Get more bang for your backup today.Faster performance. Easier deployment. Lower cost.
provide higher
ea ng compet
reserv ng e or s ost mporta
Contact us to learn more at (8
2011 Quantum Corporation. All rights reserved.
t
l
http://www.facebook.com/quantumcorphttp://twitter.com/QuantumCorphttp://www.youtube.com/QuantumCorp -
8/13/2019 0711 ST eGuide DataDeduplicationBackup v2
3/26
dupe tutorial
Source-sidededupe
Restoringeduped data
oubleshooting
Globaleduplication
Sponsor
resources
3
iF YOUASKstorage managers what keeps them up at night, the answer will bebackup. But over the past few years, data deduplication for backup has emerged,making backup a little less formidable, while focusing more attention on dataprotection than ever before.The premise of backup data deduplication is simple: save less redundant dataand youll be able to keep more on disk for a longer period of time. Thats impor-tant because as any backup admin knows, backing up isnt the hard partitsrestoring lost or damaged data thats tough. By having a weeks or even amonths worth of data on nearline stor-
age, those restores are a lot easier and
faster compared to restores from tape.
According to our survey research,
about 27% of storage shops are already
using data dedupe in their backupoperations. That number has been
climbing steadily since we started
tracking it in 2006 when less than
5% were using dedupe. Its likely, too,
that by the end of the year, there will
be even more dedupe deployments
to tally as 30% said theyre planning
implementations.
As with most other storage technologies, opting for data deduplication in
backup is a lot more complex than making a simple go/no go decision. There
are a number of key variations on the technology that can make choosingsomewhat difficult, but it can also help to ensure that you get the specific
capabilities and features your environment requires.
And as dedupe technology matures, the list of options grows. For instance,
source-side dedupe, where your backup app handles deduplication chores, now
offers solid alternatives to dedupe appliances that had dominated the market.
You can also expect that your tried-and-true backup processes may need
to be significantly altered. A lot of that alteration will involve the elimination of
burdensome, time-consuming tasks, but other tasks may have to be modified
Dedupe debateisntwhy,buthow
Data deduplication for backup might not be a brand-newtechnology anymore, but there are plenty of new
developments and capabilities, and a lot ofdecisions to make before deploying dedupe.
Copyright 2011, TechTarget. No part of this publication may be transmitted or reproduced in any form, or by any means, without permission in wrifrom the publisher. For permissions or reprint information, please contact Mike Kelly, VP and Group Publisher (mkelly@techtarget.com).
editorial|rich castagna
According to oursurvey research,
about 27% of storageshops are alreadyusing data dedupein their backupoperations.
mailto:mkelly@techtarget.commailto:mkelly@techtarget.com -
8/13/2019 0711 ST eGuide DataDeduplicationBackup v2
4/26
STORAGE
SearchDataBackup.com Essential Guide to Data Deduplication and Backup
to accommodate the new deduped environment. Restoring data is a good
example, where deduplicated data must be rehydrated before it can berestored to production.
Data deduplication is probably the biggest thing to happen to backup since
disk was inserted into the process, but to reap the greatest benefit from this
technology, youll need to be armed with the facts. This guide addresses all of
these issues and more. Its designed to help you make more informed choices
and to be aware of any potential pitfalls on the road to backup efficiency. 2
Rich Castagna is the editorial director of TechTargets Storage Media Group.
dupe tutorial
Source-sidededupe
Restoringeduped data
oubleshooting
Globaleduplication
Sponsor
resources
4
-
8/13/2019 0711 ST eGuide DataDeduplicationBackup v2
5/26SearchDataBackup.com Essential Guide to Data Deduplication and Backup
dupe tutorial
Source-sidededupe
Restoringeduped data
oubleshooting
Globaleduplication
Sponsor
resources
5
Deduplicationin data backupenvironmentstutorial
Learn everything you need to know about
data deduplication and backup in this
tutorial. By W. Curtis Preston
dATA DEDUPLICATION ISone of the biggest game-changers in data backup anddata storage in the past several years, and it is important to have a firmunderstanding of its basics if youre considering using it in your environment.When the term deduplication, also referred to as data dedupe or datadeduping, is used without any qualifiers (e.g. file-level dedupe), we aretypically referring to subfile-level deduplication. This means that individualfiles are broken down into segments and those segments are examinedfor commonality. If two segments are deemed to be the same (even ifthey are in different files), then one of the segments is deleted and replaced
with a pointer to the other segment. Segments that are deemed to be new
or unique are, of course, stored as well.
Different fileseven if they are in different filesystems or on different
serversmay have segments in common for a number of reasons. In backups,
-
8/13/2019 0711 ST eGuide DataDeduplicationBackup v2
6/26
STORAGE
SearchDataBackup.com Essential Guide to Data Deduplication and Backup
duplicate segments between files might indicate that the same exact file exists
in multiple places. Duplicates are also created when performing repeated fullbackups of the same servers. Finally, duplicate segments are created when
performing incremental backups of files. Even if only a few bytes of a file have
changed, the entire file is usually backed up by the backup system. If you break
that file down into segments, most of the segments between different versions
of the same file will be the same, and only the new, unique segments need to be
stored.
WHERE YOUR DATA IS DEDUPED: INLINE VS. POST-PROCESSING DEDUPLICATION
The two primary approaches (inline deduplication and post-processing dedupli-
cation) are roughly analogous to synchronous replication and asynchronousreplication. Inline deduplication is roughly analogous to synchronous replication,
as it does not acknowledge a write until a segment has been determined to
be unique or redundant; the original,
native, data is never written to disk.
In an inline system, only new, unique
segments are written to disk. Post-
process deduplication is roughly analo-
gous to asynchronous replication as
it allows the original data to be written
to disk and deduplicated at a later
time. Later can be in seconds,minutes, or hours later depending
on which system we are talking about
and how it has been configured.
Inline vendors claim to be more
efficient and require less disk. Post-
process vendors claim to allow for
faster initial writes and faster read
performance for more recent data
mainly because it is left stored in its
native format. Both approaches have merits and limitations, and one should
not select a product based on its position in this argument alone. One shouldselect a product based on its price/performance numbers, which may or may
not be affected by their choice to do inline or post-process.
HOW IS DUPLICATE DATA IDENTIFIED?
There are three primary approaches to this question: hash-based, modified
hash-based and delta differential. Hash-based vendors take segments of files
and run them through a cryptographic hashing algorithm, such as SHA-1, Tiger,
dupe tutorial
Source-sidededupe
Restoringeduped data
oubleshooting
Globaleduplication
Sponsor
resources
6
The two primaryapproaches (inlinededuplication andpost-processing
deduplication) areroughly analogousto synchronousreplication andasynchronousreplication.
-
8/13/2019 0711 ST eGuide DataDeduplicationBackup v2
7/26
STORAGE
SearchDataBackup.com Essential Guide to Data Deduplication and Backup
or SHA-256, each of which create a numeric value (such as 160 bits or 256 bits
depending on the algorithm) that can be compared against the numeric values
of every other segment that the dedupe system has ever seen. Two segmentsthat have the same hash are considered to be redundant. A modified hash-
based approach typically uses a much smaller hash (e.g., cyclic redundancy
check, or CRC, of only 16 bits) to see if two segments might be the same; they
are referred to as redundancy candidates. If two segments look like they might
be the same, a binary-level comparison verifies that they are indeed the same
before one of them is deleted and replaced with a pointer.
Delta differential systems attempt to associate larger segments to each
other (e.g., two full backups of the same database) and do a block-level com-
parison of them against each other. The delta differential approach is only
useful in backup systems, as it only works when comparing multiple versions
of the same data to each other. This does not happen in primary storage;therefore, all primary storage deduplication systems use either the hash-
based or the modified hash-based approach to identifying duplicate data.
TARGET DEDUPE VS. SOURCE AND HYBRID APPROACHES
Where should deduplicate data be identified? This question only applies to
backup systems, and there are three possible answers: target, source, and
hybrid. A target deduplication system is used as a target for regular (non-
deduplicated backups), and is typically presented to the backup server as a
NAS share or virtual tape library (VTL).
Once the backups arrive at the target,they are deduplicated (either inline or
as a post-process) and written to disk.
This is referred to as target dedupe, and
its main advantage is that it allows you
to keep your existing backup software.
If youre willing to change backup
software, you can switch to source
deduplication, where duplicate data is
identified at the server being backed up,
and before it is sent across the network.
If a given segment or file has alreadybeen backed up, it is not sent across the LAN or WAN againit has been deduped
at the source. The biggest advantage to this approach is the savings in band-
width, making source dedupe the perfect solution for remote and mobile data.
The hybrid approach requires a little bit more explanation. It is essentially
a target deduplication system, as redundant data is not eliminated until it
reaches the target; however, it is not as simple is that. Remember that to
deduplicate data, the files must be first broken down into segments. In a
hash-based approach, a numeric valueor hashis then calculated on the
dupe tutorial
Source-sidededupe
Restoringeduped data
oubleshooting
Globaleduplication
Sponsor
resources
7
The biggest advantageto target dedupe is thesavings in bandwidth,making source dedupethe perfect solutionfor remote andmobile data.
-
8/13/2019 0711 ST eGuide DataDeduplicationBackup v2
8/26
STORAGE
SearchDataBackup.com Essential Guide to Data Deduplication and Backup
segment, and then that value is looked up in the hash table to see if it has
been seen before. Typically, all three of these steps are performed in the same
placeeither at the source or the target. In a hybrid system, the first one ortwo steps can be done on the client being backed up, and the final step can
be done on the backup server. The advantage of this approach (over typical
target approaches) is that data may be compressed or encrypted at the
client. Compressing or encrypting data before it reaches a typical target
deduplication system would significantly impact your dedupe ratio, possibly
eliminating it altogether. But this approach allows for both compression and
encryption before data is sent across the network.
PRIMARY STORAGE DEDUPLICATION
Deduplication is also used in primary data storage, where duplicate data isnot as commonbut it does exist. Just as in backups, the same exact file
may reside in multiple places, or end users may save multiple versions of
the same file as a way of protecting themselves against fat finger incidents.
One type of data that has a lot of commonality between different files is system
images for virtualization systems. The C: (or root) drive for one system is almost
exactly the same as the C: (root) drive for another system. A good deduplication
system will identify all of those com-
mon files and segments and replace
them with a single copy.
Whether were talking backups or
primary storage, the amount of disksaved is highly dependent on the type
of data being stored and the amount of
duplicate segments found within that
data. Typical savings in backup range
from 5:1 to 20:1 and average around 10:1,
with users who do frequent full back-
ups tending towards the higher ratios.
Savings in primary storage are usually
expressed in reduction percentages,
such as there was a 50% reduction,
which sounds a lot better than a 2:1deduplication ratio. Savings in primary storage range from 50% to 60% or
more for typical data, and as much as 90% or more for things like virtual
desktop images.
YOUR MILEAGE WILL VARY
There is no may about ityour mileage will vary when data deduping. Your
dedupe performance and data deduplication ratio will be significantly different
dupe tutorial
Source-sidededupe
Restoringeduped data
oubleshooting
Globaleduplication
Sponsor
resources
8
Whether were talkingbackups or primarystorage, the amountof disk saved is highlydependent on the typeof data being storedand the amount ofduplicate segmentsfound within that data.
-
8/13/2019 0711 ST eGuide DataDeduplicationBackup v2
9/26
-
8/13/2019 0711 ST eGuide DataDeduplicationBackup v2
10/26
jSearchDataBackup.com Essential Guide to Data Deduplication and Backup
dupe tutorial
Source-sidededupe
Restoringeduped data
oubleshooting
Globaleduplication
Sponsor
resources
10
ULIAN COOPER, senior IT administrator at Integrated Control Corp (ICC),
recently deployed source-based deduplication after an overhaul of themedium-sized business backup strategy. The company was performing
backup to tape but was struggling with slow restores and failed backups.
After moving to a disk backup approach, Cooper began looking for ways to
reduce backup data and considered both data deduplication and archiving.
Most of the leading data backup applications now include source dedu-
plication, including CA Inc.s ArcServe Backup, CommVault System Inc.s
Simpana, EMC Corp.s Avamar, IBM Corp.s Tivoli Storage Manager (TSM),
and Symantec Corp.s Backup Exec and NetBackup. ICC is a Symantec
Source-baseddeduplication explained
Most of the leading data backup apps have sourcededuplication. Find out whether or not source dedupe
can help your organization shorten your backupwindows, and learn about the pros and cons of
source vs. target deduplication. By Andrew Burton
proscons
proscon
p
roscon
-
8/13/2019 0711 ST eGuide DataDeduplicationBackup v2
11/26
STORAGE
SearchDataBackup.com Essential Guide to Data Deduplication and Backup
Backup Exec shop, and they back up to a Dell PowerVault MD1000 direct-at-
tached storage (DAS) array onsite and use the Symantec Online BackupService for off-site backup. We were backing up between 350 GB and 375 GB
for full backups, Cooper said. When we started deduping, we saved about 50
GB [on a full backup], so that was a huge benefit.
And, though they were using
Backup Exec before they decided to
run deduplication, Cooper took his
time evaluating how running source
deduplication would affect the rest
of his environment. The decision to
use it wasnt as simple as just turn-
ing it on. He said, I wish it was thatsimple. I wish [products] always just
worked. Then itd be so easy, like oh
new product? This is great.
It was a combination of things.
Whats the cost? Whats the learn-
ing curve? Whats the cost of po-
tentially expanding out my servers?
In the end, it made more sense to
use what we had more efficiently
rather than buying more or bigger systems, he added.
BEST ENVIRONMENTS FOR SOURCE DEDUPE
As the name implies, source dedupe products process deduplication on the
application, or client, server before sending data across a network to the back-
up target. This is a compelling benefit for users looking to alleviate bandwidth
constraints. For example, a company might choose source-based dedupe when
backing up a remote office to a central data center, reducing the amount of data
they have to send across the WAN. This is a major driver for source dedupe to-
day, according to Jeff Boles, senior analyst with the Taneja Group. If youve only
got a couple machines at a remote office, maybe you dont see a need to invest
in an expensive Riverbed-type WAN optimization appliance, he said.Within the data center, reducing data at the source can take a lot of strain off
your local network. This can be particularly useful in virtualized environments.
If you look at the data within a virtual machine disk file, there is a lot of
redundancy across virtual machines, said Lauren Whitehouse, senior analyst
with Enterprise Strategy Group. If you look at a physical system, you have
your operating system running once and then whatever applications and data.
On a host system running multiple virtual machines, that operating system
exists multiple times.
dupe tutorial
Source-sidededupe
Restoringeduped data
oubleshooting
Globaleduplication
Sponsor
resources
11
We were backing upbetween 350 GB and375 GB for full backups.When we starteddeduping, we savedabout 50 GB [on a fullbackup], so that wasa huge benefit.
JULIAN COOPER, senior IT administrator,
Integrated Control Corp.
-
8/13/2019 0711 ST eGuide DataDeduplicationBackup v2
12/26
STORAGE
SearchDataBackup.com Essential Guide to Data Deduplication and Backup
However, there is a tradeoff, said Whitehouse. The I/O processing to per-
form the deduplication might put a strain on the physical server being shared bythe virtual machines. She went on to say that while you might see resource con-
tention while the backup is running, deduplicated backups take considerably less
time to complete. So, its a matter of weighing one against the other.
Cooper is currently evaluating just that in ICCs environment. He has deployed
server virtualization using Microsofts Hyper-V, but has not virtualized tier-one
critical apps yet. The main thing I want to see is the performance impact of
deduping on the virtual system itself, he said. Anything you add on to the virtual
system takes resources, so you have to find the balance between the benefits
you see and the strain you are putting on the server. On the flip side, [source
deduplication] reduces the horsepower you need in your storage destination, be-
cause the target is not responsible for processing the deduplication, said Boles.Cooper also said that he uses Symantecs backup reporting tools to further
optimize backups. The reporting makes things so much easier when, say, we
have an extra 40 gigs in a month, he said. Whats included in the backup? Is it
an anomaly? Is it natural growth? Can we archive some of this? It helps us un-
derstand whats going on in our environment.
SOURCE VS. TARGET DEDUPE
Certain organizations may not be
able to live with the performance
hit on physical servers that comeswith source dedupe. If you have a
high performance, [and are] pro-
cessing [an] intensive environment
that doesnt have a lot of down-
time, you might be hard pressed to
implement source dedupe graceful-
ly, said Boles. That type of environ-
ment is generally better served by
a target-based system such as
Quantum Corp.s DXi series, IBMs
ProtectTier, NEC Corp.s Hydrastorseries, FalconStor Software Inc.s
File-interface Deduplication System
(FDS), or EMCs Data Domain series.
Target deduplication may also be a better fit for an organization running
a backup application that does not have built-in deduplication capabilities
or running multiple backup applications. Some organizations may find that
source makes sense for some backup jobs and target makes sense for others.
Today, some backup software vendors, such as CommVault and Symantec,
dupe tutorial
Source-sidededupe
Restoringeduped data
oubleshooting
Globaleduplication
Sponsor
resources
12
If you have a high
performance, [and are]processing [an] inten-sive environment thatdoesnt have a lot ofdowntime, you mightbe hard pressed toimplement source
dedupe gracefully.JEFF BOLES, senior analyst, Taneja Group
-
8/13/2019 0711 ST eGuide DataDeduplicationBackup v2
13/26
STORAGE
SearchDataBackup.com Essential Guide to Data Deduplication and Backup
are responding to this need in the market, offering products that can perform
deduplication at the source or at the backup target. However, these productsare just emerging.
Finally, there is a perception that source-based deduplication is cheaper
than target-based deduplication. The fact that it is built into a lot of backup
software solutions may contribute to the perception that it is cheaper, said
Whitehouse. But most of those vendors typically charge extra for that feature.
And, just because you are not buying a target deduplication system doesnt
mean that you dont need storage. I think if you look at the total cost of owner-
ship it might be cheaper, but not significantly so. 2
Andrew Burton is the site editor of SearchDataBackup.
dupe tutorial
Source-sidededupe
Restoringeduped data
oubleshooting
Globaleduplication
Sponsor
resources
13
-
8/13/2019 0711 ST eGuide DataDeduplicationBackup v2
14/26SearchDataBackup.com Essential Guide to Data Deduplication and Backup
dupe tutorial
Source-sidededupe
Restoringeduped data
oubleshooting
Globaleduplication
Sponsor
resources
14
rESTORES FROM SOME data dedupe systems can be slower than modern tapedrives, and restores from other deduplication systems can be faster than tapecould ever dream of. In this column, you will learn what you need to do tomake sure your dedupe system is on the right side of those two extremes.To explain why there is what many call a dedupe tax, we need to go backin time to when there was no data deduplication technology. Before dedupe,we wrote data to tape or disk in contiguous blocks. Writing contiguous blocks
Restoring dedupeddata in backupenvironments
Restoring deduped data in some deduplication systemscan be slower than restores from tape. Find out howto analyze your dedupe systems performance and
how to speed up your restores. By W. Curtis Preston
-
8/13/2019 0711 ST eGuide DataDeduplicationBackup v2
15/26
STORAGE
SearchDataBackup.com Essential Guide to Data Deduplication and Backup
of backup data to disk requires either an empty filesystem, or a disk system
designed for that purpose, such as a virtual tape library (VTL). The blocks thatcomprised a given backup were all located in proximity to each other.
Data backup systems also occasionally perform a full backup, a synthetic
full backup, or otherwise collocate files necessary for a complete restore
(e.g., IBM Tivoli Storage Managers
active data pools). These typical
behaviors of backup systems meant
that the bulk of the blocks needed to
restore a given system would all be
placed contiguously on disk or tape,
making a complete restore of that
system very easy to accomplish.Most would agree that the best thing
for fast restores would be to have a
recent full or (updated active data pool)
on disk, and to have any subsequent
backups also on disk. This is why it
comes as a surprise to many when they
read that restores from a deduplication
system (which is almost always on disk)
could be slower than what theyre used toeven slower than from tape.
The reason for this is that no matter how you back up your data to a dedu-
plication system, it is rarely stored contiguously on disk. Since the blocks thatcomprise the latest backup were actually created over time, those blocks will
be stored all over the dedupe system, based on when they were backed up.
A restore from deduped data is therefore a very fragmented read from disk.
Instead of a single disk seek followed by a large disk read of hundreds or
thousands of blocks, you could have hundreds of disk seeks and reads from
hundreds of different disk drives.
ANALYZING YOUR DEDUPLICATION SYSTEMS PERFORMANCE
So how is it that some deduplication systems will have better read perform-
ance than other systems? It depends on a few things. The first factor is themanner in which the systems store data each day and the degree to which
their data storage method fragments the data. The second factor is the degree
to which they do things to mitigate the negative restore effects of their data
storage methods. Finally, some dedupe systems may have a single stream
limitation that can impact their restore speed.
The first thing that affects how dedupe systems store data is what they do
when they find a match to an older segment of data. Do they leave the old
dupe tutorial
Source-sidededupe
Restoringeduped data
oubleshooting
Globaleduplication
Sponsor
resources
15
Most would agreethat the best thing forfast restores wouldbe to have a recentfull or (updated activedata pool) on disk,and to have anysubsequent backupsalso on disk.
-
8/13/2019 0711 ST eGuide DataDeduplicationBackup v2
16/26
STORAGE
SearchDataBackup.com Essential Guide to Data Deduplication and Backup
segment in place and write a pointer to the older segment instead of writing
the new segment to disk (reverse referencing)? Or do they write the newsegment to disk, delete the older segment and replace it with a pointer to the
new segment (forward referencing)? Deduplication systems that continually
write pointers for newer, redundant segments may result in more fragmentation
for newer data. Systems that always write newer data and delete older data
may result in more contiguously stored data from newer backups. Forward
referencing is only possible in post-processing deduplication systems. This
method of storing data should result in faster restores of more recent data,
which is where most restores come from.
The next thing to consider is whether
the system tries to collocate segments
that it can collocate. When a system isaccepting a stream of backup, it may
have some knowledge that certain
data segments are associated with
other data segments and do its best to
try to store them in a way that allows
for more contiguous storage of related
segments.
Another thing that may impact how
data is stored is how the dedupe
system does what is commonly
referred to as garbage collection. As data is expired, some data segmentswill no longer be needed and can be discarded. Some vendors delete such
segments with no regard to how this increases the fragmentation of the
remaining data. Other vendors consider such things during their garbage
collection process, which is typically run on a daily basis.
MITIGATING THE DATA DEDUPE TAX
Vendors do a number of things to mitigate the dedupe tax. Some vendors do
extra work during their garbage collection process to actually relocate related
segments together, especially those segments that do not have anything in
common with other backups. Think of this as a fancy defragmentation process.Other vendors keep last nights backup stored on disk in its native format so that
if someone asks to restore (or copy) from last nights backup, they can satisfy
that need without bringing dedupe into the picture. While this is very desirable
from a restore speed perspective, it is another technology that requires a post-
process architecture, which carries with it the concept of a landing zone where
backups are stored in their native format before they are deduped. That landing
zone will require extra disk that is not required in an inline configuration.
dupe tutorial
Source-sidededupe
Restoringeduped data
oubleshooting
Globaleduplication
Sponsor
resources
16
Deduplication systemsthat continually writepointers for newer,redundant segmentsmay result in morefragmentation fornewer data.
-
8/13/2019 0711 ST eGuide DataDeduplicationBackup v2
17/26
STORAGE
SearchDataBackup.com Essential Guide to Data Deduplication and Backup
Finally, some vendors also suffer from a single stream limitation. That is,
their architecture does not allow a single stream of data out of their productany faster thann MBps. If you plan on copying data from a deduplication system
to tape, its single stream restore speed capabilities are paramount. This is
because a tape copy is essentially a very demanding restore; not only does
the device have to reassemble all the appropriate bits back to their native
form, they must do so in a very fast single stream of data. There is no point
in trying to copy data to an LTO-5 tape drive that wants 240 MBps (assuming
1.5:1 compression) if the fastest your deduplication system can supply is 90 MBps.
Suffice it to say that not all dedupe systems are created equal when it comes
to restore speed. Make sure you do your homework. 2
W. Curtis Preston is an independent backup expert.
dupe tutorial
Source-sidededupe
Restoringeduped data
oubleshooting
Globaleduplication
Sponsor
resources
17
-
8/13/2019 0711 ST eGuide DataDeduplicationBackup v2
18/26SearchDataBackup.com Essential Guide to Data Deduplication and Backup
TS BEEN SAIDthat we never really solve any problems in ITwe just
move them. Data deduplication is no exception to that rule. While
deduplication systems have helped make data backup and recovery
much easier, they also come with a number of challenges. The savvy
storage or backup administrator will familiarize themselves with
these challenges and do whatever they can to work around them.
Your backup system creates duplicate data in three different ways:
repeated full backups of the same file system or application; repeated
incremental backups of the same file or application; and backups offiles that happen to reside in multiple places (e.g., the same
OS/application on multiple machines). Hash-based deduplication systems
(e.g., CommVault Systems Inc., EMC Corp., FalconStor Software, Quantum
Corp., Symantec Corp.) will identify and eliminate all three types of duplicate
data, but their level of granularity is limited to their chunk size, which is typi-
cally 8K or larger. Delta-differential-based deduplication systems (e.g., IBM
Corp., ExaGrid Systems, Sepaton Inc.) will only identify and eliminate the
first two types of duplicate data, but their level of granularity can be as
dupe tutorial
Source-sidededupe
Restoringeduped data
oubleshooting
Globaleduplication
Sponsor
resources
18
Solvingcommon datadeduplication
systemproblemsStill not getting the deduplication ratios vendors
are promising? Learn about the most commonproblems with deduplication systems and
how to fix them. By W. Curtis Preston
i
-
8/13/2019 0711 ST eGuide DataDeduplicationBackup v2
19/26
STORAGE
SearchDataBackup.com Essential Guide to Data Deduplication and Backup
small as a few bytes. These differences typically result in a dedupe ratio
draw, but can yield significant differences in certain environments, which iswhy most experts suggest you test multiple products.
Because roughly half of the duplicate data in most backup data comes from
multiple full backups, people using IBM Tivoli Storage Manager (TSM) as their
backup product will experience lower deduplication ratios than customers
using other backup products. This is due to TSMs progressive incremental
feature that allows users to never again do a full backup on file systems
being backed up by TSM. However, because TSM users perform full backups
on their databases and applications, and because full backups arent the only
place where duplicate data is found, TSM users can still benefit from dedupli-
cation systemstheir dedupe ratios will simply be smaller.
The second type of duplicate datacomes from incremental backups,
which contain versions of files or ap-
plications that have changed since the
most recent full backup. If a file is
modified and backed up every day,
and the backup system retains back-
ups for 90 days, there will be 90
versions of that file in the backup
system. A deduplication system will
identify the segments of data that are
unique and redundant among those 90different versions and store only the
unique segments. However, there are
file types that do not have different
versions (such as photos or imaging
data and PDF files); every file is unique
unto itself and is not a version of a previous iteration of the same file. An
incremental backup that contains these types of files contains completely
unique data, so there is nothing to deduplicate them against. Since there is
a cost associated with deduplicated storage, customers with significant
portions of such files should consider not storing them on a deduplication
system, as they will gain no benefit and only increase their cost.
DATA DEDUPLICATION SYSTEMS AND ENCRYPTION: WHAT TO WATCH OUT FOR
Data deduplication systems work by finding and eliminating patterns; encryp-
tion systems work by eliminating patterns. Do not encrypt backup data before
it is sent to the deduplication systemyour deduplication ratio will be 1:1.
Compression works a little like encryption in that it also finds and eliminates
patterns, but in a very different way. The way most compression systems work
dupe tutorial
Source-sidededupe
Restoringeduped data
oubleshooting
Globaleduplication
Sponsor
resources
19
A deduplicationsystem will identifythe segments of datathat are unique andredundant amongthose 90 different
versions and storeonly the uniquesegments.
-
8/13/2019 0711 ST eGuide DataDeduplicationBackup v2
20/26
STORAGE
SearchDataBackup.com Essential Guide to Data Deduplication and Backup
results in a scrambling of the data that has a similar effect as encryption; it
can also completely remove all abilities of your deduplication system todeduplicate the data.
The compression challenge often results in a stalemate between database
administrators who want their backups to go faster and backup admins who
want their backups to get deduped. Since databases are often created with
very large capacities and very small actual amounts of data, they tend to
compress very well. This is why turning on the backup compression feature
often results in database backups that go two to four times faster than they do
without compression. The only way to get around this particular challenge is to
use a backup software product that has integrated source dedupe and client
compression, such as CommVault Simpana, IBM TSM or Symantec NetBackup.
MULTIPLEXING AND DEDUPLICATION SYSTEMS
The next dedupe challenge with backups only applies to companies using virtual
tape libraries (VTLs) and backup software that supports multiplexing. Multiplex-
ing several different backups to the same tape drive can also scramble the data
and completely confound all dedupe. Even products that are able to decipher the
different backup streams from a multiplexed image (e.g., FalconStor, Sepaton)
tell you not to multiplex backups to their devices because it simply wastes time.
CONSIDER THE DEDUPE TAXThe final backup dedupe challenge has
to do with the backup window. The
way that some deduplication systems
perform the dedupe task actually re-
sults in a slow-down of the incoming
backup. Most people dont notice this
because they are moving from tape to
disk, and a dedupe system is still
faster. However, users who are already
using disk staging may notice a reduc-
tion in backup performance and an increase in the amount of time it takes to
back up their data. Not all products have this particular characteristic and
the ones that do demonstrate it in varying degreesonly a proof-of-concept
test in your environment will let you know for sure.
The restore challenge is much easier to understand; the way most dedupli-
cation systems store data results in the most recent backups being written
in a fragmented way. Restoring deduplicated backups may therefore take
longer than it would have taken if the backup had not been deduplicated. This
phenomenon is referred to as the dedupe tax.
dupe tutorial
Source-sidededupe
Restoringeduped data
oubleshooting
Globaleduplication
Sponsor
resources
20
The way that somededuplication systemsperform the dedupetask actually resultsin a slow-down of theincoming backup.
-
8/13/2019 0711 ST eGuide DataDeduplicationBackup v2
21/26
STORAGE
SearchDataBackup.com Essential Guide to Data Deduplication and Backup
When considering the dedupe tax, think about whether or not youre planning
to use the dedupe system as the source for tape copies, because it is during
large restores and tape copies that the dedupe tax is most prevalent. Suppose,
for example, that you plan on using LTO-5 drives that have a native speed of
140 MBps and a native capacity of 1.5 TB. Suppose also that you have examined
your full backup tapes and have discovered that you consistently fit 2.25 TB
of data on your 1.5 TB tapes, meaning that youre getting a 1.5:1 compression
ratio. This means that your 140 MBps tape drive should be running at roughly
210 MBps during copies. Make sure that during your proof of concept you
verify that the dedupe system is able to provide the required performance
(210 MBps in this example). If it cannot, you may want to consider a different
system.
The final challenge with deduplicated restores is that they are still restores,
which is why dedupe is not a panacea. A large system that must be restored
still requires a bulk copy of data from the dedupe system to the production
system. Only a total architectural change of your backup system from tradi-
tional backup to something like continuous data protection (CDP) or near-CDP
can address this particular challenge, as they offer restore times measured in
seconds not hours.
Data deduplication systems offer the best hope for making significant
enhancements to your current backup and recovery system without making
wholesale architectural changes. Just be sure that you are aware of the
challenges of dedupe before you sign a purchase order. 2
W. Curtis Preston is an independent backup expert.
dupe tutorial
Source-sidededupe
Restoringeduped data
oubleshooting
Globaleduplication
Sponsor
resources
21
-
8/13/2019 0711 ST eGuide DataDeduplicationBackup v2
22/26
dSearchDataBackup.com Essential Guide to Data Deduplication and Backup
EDUPLICATION INITIALLY GAINEDprominence in disk backup appliances almost10 years ago. A number of improvements have occurred along the way that
upgraded its overall performance and effectiveness in backup systems as
well as in primary storage. Global deduplication, developed as a method to
increase efficiency and enable greater scalability, has been integrated into
products in a number of categories.
Briefly, the dedupe process parses an input data stream into (typically)
sub-file-sized blocks, runs a hashing algorithm on them (somewhat like a
checksum) and creates a unique identifier for each. These hash keys are
dupe tutorial
Source-sidededupe
Restoringeduped data
oubleshooting
Globaleduplication
Sponsor
resources
22
Global
deduplicationimplementation
varies by productGlobal deduplication has been integrated in a
number of dedupe products. Discover whatglobal really means and the benefits global
dedupe can provide. By Eric Slack
-
8/13/2019 0711 ST eGuide DataDeduplicationBackup v2
23/26
STORAGE
SearchDataBackup.com Essential Guide to Data Deduplication and Backup
then stored in an index or hash table, which is used to compare subsequent
data blocks and determine which blocks are duplicates. When a duplicate isencountered, a pointer to the existing block is created, instead of storing the
block a second time. This way, only unique blocks and hash keys are stored
and redundant blocks are eliminated, or deduplicated, from the data set.
A dedupe systems effectiveness is a function of its ability to find dupli-
cate blocks, which, in turn, is directly related to the size of the pool of blocks
it can store and represent in the hash table. In general, more blocks and a
larger hash table means better dedu-
plication. Also, dedupe systems need
to scale as storage growth continues,
without impacting performance, fur-
ther driving the need for a larger pool ofdata blocks to support the dedupe
process. Global deduplication is the
way many dedupe vendors are ad-
dressing this requirement.
But exactly how to make that pool
larger depends on the implementation
(where the dedupe engine sits) and the
architecture of the storage system its
connected to. This has led to dedupe
systems sharing hash keys in an effort
to expand the number of blocks com-pared by different dedupe engines. Its
also led to an expansion of block pools
to support greatly scaled, clustered storage systems and a method for sharing
the correspondingly large hash tables resident on multiple storage modules.
DEFINING GLOBAL
The term global doesnt refer to a consistent process or architecture when
applied to deduplication. Its most commonly used as a relative term, meant
to differentiate a process that makes use of an expanded or shared index
(global dedupe) compared with one that has a single index (local dedupe).When implemented in backup software (most enterprise backup software
applications do have some kind of global deduplication functionality), this
index or hash table is shared among individual dedupe processing engines
as a method for improving dedupe efficiency and reducing data handling. In
hardware, its more typically a method for scaling the dedupe system, by
sharing a larger pool of common blocks and a larger hash table among multi-
ple controllers or storage modules; EMC Corp. Data Domain takes this ap-
proach in global deduplication hardware.
dupe tutorial
Source-sidededupe
Restoringeduped data
oubleshooting
Globaleduplication
Sponsor
resources
23
A dedupe systemseffectiveness is afunction of its abilityto find duplicateblocks, which, in turn,is directly related tothe size of the poolof blocks it can storeand represent in the
hash table.
-
8/13/2019 0711 ST eGuide DataDeduplicationBackup v2
24/26
STORAGE
SearchDataBackup.com Essential Guide to Data Deduplication and Backup
Some manufacturers have called their systems global when all theyve
done is connect independent modules together with replication software, with-out actually sharing any blocks or an index, so these systems dont actually
apply dedupe across the group of independent modules. While they certainly
have the right to promote their products as they see fit, care must be taken
to understand the functionality that they claim makes them global.
GLOBAL DEDUPLICATION IN SOFTWARE
In backup software, source-side dedupe runs on the client, or application,
server, where data blocks are hashed and keys are created by the backup
client. But each key is compared with a hash table thats stored on the back-
up server, media agent or on dedicated backup storage hardwarenot on the
client server. Unique blocks and keys are sent to these same devices, and
duplicates are referenced by the client. All clients share the same global
index and pool of unique blocks as opposed to earlier client-side local
dedupe processes that only compared blocks within each client servers
backup jobs. In general, source-side dedupe was made significantly more
effective when a shared hash table was implemented. As a variant, some
applications have the option to run software dedupe on the media agent or
backup server instead of the client, a process called target-side dedupe,
similar to the way hardware deduplication is implemented.
GLOBAL DEDUPLICATION IN HARDWARE
One of the earliest implementations of dedupe was within a dedicated backup
appliance, which connected to the backup server and presented itself as a NAS
device or virtual tape library (VTL). It was essentially a large local dedupe system,
since the process was performed in a single box. In response to the need to
scale, however, some of these target-side dedupe hardware vendors have also
come out with their version of global dedupe. These systems essentially com-
bine two separate dedupe processors with an expanded storage capacity and
share the hash table between them, enabling them to scale into the low hun-
dreds of terabytes. There are other hardware-based dedupe systems that haveleveraged this same shared-controller/shared-hash-table design but expanded
it to a multiple-node architecture to support even larger data sets.
Similar to the target-side dedupe appliance, there are backup systems that
scale by adding modules or nodes. The dedupe processing is done by a dedi-
cated node or nodes, which share the index across all the nodes storing data
blocks. These systems, which are presented as a large NAS or VTL to the back-
up software, can scale into the petabyte range.
Some clustered storage systems, also called object-based storage, now
dupe tutorial
Source-sidededupe
Restoringeduped data
oubleshooting
Globaleduplication
Sponsor
resources
24
-
8/13/2019 0711 ST eGuide DataDeduplicationBackup v2
25/26
STORAGE
SearchDataBackup.com Essential Guide to Data Deduplication and Backup
offer deduplication but differ from clustered backup appliances. This node-based
topology supports extremely large and distributed infrastructure, and its use ofdata objects instead of files is well-suited for distributed, global deduplication.
These systems typically run the hash calculation on objects within each node,
compiling a hash index for the node, but share its access with other nodes.
Not specifically designed for backup, they represent one way that dedupe has
moved into primary storage.
As a VAR, understanding various vendors products in a space like dedupli-
cation is essential, since VARs are frequently called upon by customers to
explain the differences between technologies that use the same identifier,
like deduplication. Adding the label global to a dedupe system most often
indicates that its more efficient than a similar local system or that it can
scale larger without degrading performance. To the extent that scalability isneeded, a global data deduplication system would be more desirable than a
local dedupe system, provided the cost is appropriate. 2
Eric Slack, a senior analyst for Storage Switzerland, has more than 20 years of
experience in high-technology industries holding technical management and
marketing/sales positions in the computer storage, instrumentation, digital
imaging and test equipment fields.
dupe tutorial
Source-sidededupe
Restoringeduped data
oubleshooting
Globaleduplication
Sponsor
resources
25
-
8/13/2019 0711 ST eGuide DataDeduplicationBackup v2
26/26
SPONSOR RESOURCES
SAN For Dummies Chapter 13 Deduplication
Using FalconStor FDS as a Backup Target for Veritas NetBackup
How Data Deduplication Works
See ad page 2
Data Deduplication for Dummies
Quantum Targets Disk-Based Backup Price-Performance Leadership with New DXi 2.0
7 Deduplication Questions to Consider Before Deploying for the Midrange Data Center
http://www.bitpipe.com/detail/RES/1303488742_625.htmlhttp://www.bitpipe.com/detail/RES/1252600407_389.htmlhttp://www.bitpipe.com/detail/RES/1303407541_435.htmlhttp://www.bitpipe.com/detail/RES/1294781626_191.htmlhttp://www.bitpipe.com/detail/RES/1297351891_722.htmlhttp://www.bitpipe.com/detail/RES/1285096645_268.htmlhttp://www.bitpipe.com/detail/RES/1285096645_268.htmlhttp://www.bitpipe.com/detail/RES/1297351891_722.htmlhttp://www.bitpipe.com/detail/RES/1294781626_191.htmlhttp://quantum.com/http://www.bitpipe.com/detail/RES/1303407541_435.htmlhttp://www.bitpipe.com/detail/RES/1252600407_389.htmlhttp://www.bitpipe.com/detail/RES/1303488742_625.htmlhttp://www.falconstor.com/
top related