0711 st eguide datadeduplicationbackup v2

Upload: ashish-singh

Post on 04-Jun-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 0711 ST eGuide DataDeduplicationBackup v2

    1/26

    Managing the information that drives the enterprise

    STORAGE

    DataDeduplicationandBackup

    ESSENTIAL GUIDE TO

    INSIDE

    Data dedupe tutorial

    Source-side deduplication

    How to restore deduped data

    Solving common dedupe problems

    What you need to know about global dedupe

    Whether youre new to data deduplication,or want to brush up on the fundamentals,we have all the answers in our essentialguide on data dedupe and backup.

    http://www.techtarget.com/
  • 8/13/2019 0711 ST eGuide DataDeduplicationBackup v2

    2/26

    Quantums DXi-Series Applianceswith deduplication

    provide higher performance at lower cost than the

    leading competitor.

    Preserving The Worlds Most Important Data. Yours.

    Contact us to learn more at (866) 809-5230or visit www.quantum.com/dxi

    2011 Quantum Corporation. All rights reserved.

    Quantum has helped some of the largest organizations in the worldintegrate

    deduplication into their backup process. The benefits they report are immediate and

    significantfaster backup and restore, 90%+ reduction in disk needs, automated DR

    using remote replication, reduced administration timeall while lowering overall costs

    and improving the bottom line.

    Our award-winning DXi-Series appliances deliver a smart, time-saving approach

    to disk backup. They are acknowledged technical leaders. In fact, our DXi6500 was

    just nominated as a Best Backup Hardware finalist in Storage Magazines Best

    Product of the Year Awardsits both faster and up to 45% less expensive than theleading competitor.

    Get more bang for your backup today.Faster performance. Easier deployment. Lower cost.

    provide higher

    ea ng compet

    reserv ng e or s ost mporta

    Contact us to learn more at (8

    2011 Quantum Corporation. All rights reserved.

    t

    l

    http://www.facebook.com/quantumcorphttp://twitter.com/QuantumCorphttp://www.youtube.com/QuantumCorp
  • 8/13/2019 0711 ST eGuide DataDeduplicationBackup v2

    3/26

    dupe tutorial

    Source-sidededupe

    Restoringeduped data

    oubleshooting

    Globaleduplication

    Sponsor

    resources

    3

    iF YOUASKstorage managers what keeps them up at night, the answer will bebackup. But over the past few years, data deduplication for backup has emerged,making backup a little less formidable, while focusing more attention on dataprotection than ever before.The premise of backup data deduplication is simple: save less redundant dataand youll be able to keep more on disk for a longer period of time. Thats impor-tant because as any backup admin knows, backing up isnt the hard partitsrestoring lost or damaged data thats tough. By having a weeks or even amonths worth of data on nearline stor-

    age, those restores are a lot easier and

    faster compared to restores from tape.

    According to our survey research,

    about 27% of storage shops are already

    using data dedupe in their backupoperations. That number has been

    climbing steadily since we started

    tracking it in 2006 when less than

    5% were using dedupe. Its likely, too,

    that by the end of the year, there will

    be even more dedupe deployments

    to tally as 30% said theyre planning

    implementations.

    As with most other storage technologies, opting for data deduplication in

    backup is a lot more complex than making a simple go/no go decision. There

    are a number of key variations on the technology that can make choosingsomewhat difficult, but it can also help to ensure that you get the specific

    capabilities and features your environment requires.

    And as dedupe technology matures, the list of options grows. For instance,

    source-side dedupe, where your backup app handles deduplication chores, now

    offers solid alternatives to dedupe appliances that had dominated the market.

    You can also expect that your tried-and-true backup processes may need

    to be significantly altered. A lot of that alteration will involve the elimination of

    burdensome, time-consuming tasks, but other tasks may have to be modified

    Dedupe debateisntwhy,buthow

    Data deduplication for backup might not be a brand-newtechnology anymore, but there are plenty of new

    developments and capabilities, and a lot ofdecisions to make before deploying dedupe.

    Copyright 2011, TechTarget. No part of this publication may be transmitted or reproduced in any form, or by any means, without permission in wrifrom the publisher. For permissions or reprint information, please contact Mike Kelly, VP and Group Publisher ([email protected]).

    editorial|rich castagna

    According to oursurvey research,

    about 27% of storageshops are alreadyusing data dedupein their backupoperations.

    mailto:[email protected]:[email protected]
  • 8/13/2019 0711 ST eGuide DataDeduplicationBackup v2

    4/26

    STORAGE

    SearchDataBackup.com Essential Guide to Data Deduplication and Backup

    to accommodate the new deduped environment. Restoring data is a good

    example, where deduplicated data must be rehydrated before it can berestored to production.

    Data deduplication is probably the biggest thing to happen to backup since

    disk was inserted into the process, but to reap the greatest benefit from this

    technology, youll need to be armed with the facts. This guide addresses all of

    these issues and more. Its designed to help you make more informed choices

    and to be aware of any potential pitfalls on the road to backup efficiency. 2

    Rich Castagna is the editorial director of TechTargets Storage Media Group.

    dupe tutorial

    Source-sidededupe

    Restoringeduped data

    oubleshooting

    Globaleduplication

    Sponsor

    resources

    4

  • 8/13/2019 0711 ST eGuide DataDeduplicationBackup v2

    5/26SearchDataBackup.com Essential Guide to Data Deduplication and Backup

    dupe tutorial

    Source-sidededupe

    Restoringeduped data

    oubleshooting

    Globaleduplication

    Sponsor

    resources

    5

    Deduplicationin data backupenvironmentstutorial

    Learn everything you need to know about

    data deduplication and backup in this

    tutorial. By W. Curtis Preston

    dATA DEDUPLICATION ISone of the biggest game-changers in data backup anddata storage in the past several years, and it is important to have a firmunderstanding of its basics if youre considering using it in your environment.When the term deduplication, also referred to as data dedupe or datadeduping, is used without any qualifiers (e.g. file-level dedupe), we aretypically referring to subfile-level deduplication. This means that individualfiles are broken down into segments and those segments are examinedfor commonality. If two segments are deemed to be the same (even ifthey are in different files), then one of the segments is deleted and replaced

    with a pointer to the other segment. Segments that are deemed to be new

    or unique are, of course, stored as well.

    Different fileseven if they are in different filesystems or on different

    serversmay have segments in common for a number of reasons. In backups,

  • 8/13/2019 0711 ST eGuide DataDeduplicationBackup v2

    6/26

    STORAGE

    SearchDataBackup.com Essential Guide to Data Deduplication and Backup

    duplicate segments between files might indicate that the same exact file exists

    in multiple places. Duplicates are also created when performing repeated fullbackups of the same servers. Finally, duplicate segments are created when

    performing incremental backups of files. Even if only a few bytes of a file have

    changed, the entire file is usually backed up by the backup system. If you break

    that file down into segments, most of the segments between different versions

    of the same file will be the same, and only the new, unique segments need to be

    stored.

    WHERE YOUR DATA IS DEDUPED: INLINE VS. POST-PROCESSING DEDUPLICATION

    The two primary approaches (inline deduplication and post-processing dedupli-

    cation) are roughly analogous to synchronous replication and asynchronousreplication. Inline deduplication is roughly analogous to synchronous replication,

    as it does not acknowledge a write until a segment has been determined to

    be unique or redundant; the original,

    native, data is never written to disk.

    In an inline system, only new, unique

    segments are written to disk. Post-

    process deduplication is roughly analo-

    gous to asynchronous replication as

    it allows the original data to be written

    to disk and deduplicated at a later

    time. Later can be in seconds,minutes, or hours later depending

    on which system we are talking about

    and how it has been configured.

    Inline vendors claim to be more

    efficient and require less disk. Post-

    process vendors claim to allow for

    faster initial writes and faster read

    performance for more recent data

    mainly because it is left stored in its

    native format. Both approaches have merits and limitations, and one should

    not select a product based on its position in this argument alone. One shouldselect a product based on its price/performance numbers, which may or may

    not be affected by their choice to do inline or post-process.

    HOW IS DUPLICATE DATA IDENTIFIED?

    There are three primary approaches to this question: hash-based, modified

    hash-based and delta differential. Hash-based vendors take segments of files

    and run them through a cryptographic hashing algorithm, such as SHA-1, Tiger,

    dupe tutorial

    Source-sidededupe

    Restoringeduped data

    oubleshooting

    Globaleduplication

    Sponsor

    resources

    6

    The two primaryapproaches (inlinededuplication andpost-processing

    deduplication) areroughly analogousto synchronousreplication andasynchronousreplication.

  • 8/13/2019 0711 ST eGuide DataDeduplicationBackup v2

    7/26

    STORAGE

    SearchDataBackup.com Essential Guide to Data Deduplication and Backup

    or SHA-256, each of which create a numeric value (such as 160 bits or 256 bits

    depending on the algorithm) that can be compared against the numeric values

    of every other segment that the dedupe system has ever seen. Two segmentsthat have the same hash are considered to be redundant. A modified hash-

    based approach typically uses a much smaller hash (e.g., cyclic redundancy

    check, or CRC, of only 16 bits) to see if two segments might be the same; they

    are referred to as redundancy candidates. If two segments look like they might

    be the same, a binary-level comparison verifies that they are indeed the same

    before one of them is deleted and replaced with a pointer.

    Delta differential systems attempt to associate larger segments to each

    other (e.g., two full backups of the same database) and do a block-level com-

    parison of them against each other. The delta differential approach is only

    useful in backup systems, as it only works when comparing multiple versions

    of the same data to each other. This does not happen in primary storage;therefore, all primary storage deduplication systems use either the hash-

    based or the modified hash-based approach to identifying duplicate data.

    TARGET DEDUPE VS. SOURCE AND HYBRID APPROACHES

    Where should deduplicate data be identified? This question only applies to

    backup systems, and there are three possible answers: target, source, and

    hybrid. A target deduplication system is used as a target for regular (non-

    deduplicated backups), and is typically presented to the backup server as a

    NAS share or virtual tape library (VTL).

    Once the backups arrive at the target,they are deduplicated (either inline or

    as a post-process) and written to disk.

    This is referred to as target dedupe, and

    its main advantage is that it allows you

    to keep your existing backup software.

    If youre willing to change backup

    software, you can switch to source

    deduplication, where duplicate data is

    identified at the server being backed up,

    and before it is sent across the network.

    If a given segment or file has alreadybeen backed up, it is not sent across the LAN or WAN againit has been deduped

    at the source. The biggest advantage to this approach is the savings in band-

    width, making source dedupe the perfect solution for remote and mobile data.

    The hybrid approach requires a little bit more explanation. It is essentially

    a target deduplication system, as redundant data is not eliminated until it

    reaches the target; however, it is not as simple is that. Remember that to

    deduplicate data, the files must be first broken down into segments. In a

    hash-based approach, a numeric valueor hashis then calculated on the

    dupe tutorial

    Source-sidededupe

    Restoringeduped data

    oubleshooting

    Globaleduplication

    Sponsor

    resources

    7

    The biggest advantageto target dedupe is thesavings in bandwidth,making source dedupethe perfect solutionfor remote andmobile data.

  • 8/13/2019 0711 ST eGuide DataDeduplicationBackup v2

    8/26

    STORAGE

    SearchDataBackup.com Essential Guide to Data Deduplication and Backup

    segment, and then that value is looked up in the hash table to see if it has

    been seen before. Typically, all three of these steps are performed in the same

    placeeither at the source or the target. In a hybrid system, the first one ortwo steps can be done on the client being backed up, and the final step can

    be done on the backup server. The advantage of this approach (over typical

    target approaches) is that data may be compressed or encrypted at the

    client. Compressing or encrypting data before it reaches a typical target

    deduplication system would significantly impact your dedupe ratio, possibly

    eliminating it altogether. But this approach allows for both compression and

    encryption before data is sent across the network.

    PRIMARY STORAGE DEDUPLICATION

    Deduplication is also used in primary data storage, where duplicate data isnot as commonbut it does exist. Just as in backups, the same exact file

    may reside in multiple places, or end users may save multiple versions of

    the same file as a way of protecting themselves against fat finger incidents.

    One type of data that has a lot of commonality between different files is system

    images for virtualization systems. The C: (or root) drive for one system is almost

    exactly the same as the C: (root) drive for another system. A good deduplication

    system will identify all of those com-

    mon files and segments and replace

    them with a single copy.

    Whether were talking backups or

    primary storage, the amount of disksaved is highly dependent on the type

    of data being stored and the amount of

    duplicate segments found within that

    data. Typical savings in backup range

    from 5:1 to 20:1 and average around 10:1,

    with users who do frequent full back-

    ups tending towards the higher ratios.

    Savings in primary storage are usually

    expressed in reduction percentages,

    such as there was a 50% reduction,

    which sounds a lot better than a 2:1deduplication ratio. Savings in primary storage range from 50% to 60% or

    more for typical data, and as much as 90% or more for things like virtual

    desktop images.

    YOUR MILEAGE WILL VARY

    There is no may about ityour mileage will vary when data deduping. Your

    dedupe performance and data deduplication ratio will be significantly different

    dupe tutorial

    Source-sidededupe

    Restoringeduped data

    oubleshooting

    Globaleduplication

    Sponsor

    resources

    8

    Whether were talkingbackups or primarystorage, the amountof disk saved is highlydependent on the typeof data being storedand the amount ofduplicate segmentsfound within that data.

  • 8/13/2019 0711 ST eGuide DataDeduplicationBackup v2

    9/26

  • 8/13/2019 0711 ST eGuide DataDeduplicationBackup v2

    10/26

    jSearchDataBackup.com Essential Guide to Data Deduplication and Backup

    dupe tutorial

    Source-sidededupe

    Restoringeduped data

    oubleshooting

    Globaleduplication

    Sponsor

    resources

    10

    ULIAN COOPER, senior IT administrator at Integrated Control Corp (ICC),

    recently deployed source-based deduplication after an overhaul of themedium-sized business backup strategy. The company was performing

    backup to tape but was struggling with slow restores and failed backups.

    After moving to a disk backup approach, Cooper began looking for ways to

    reduce backup data and considered both data deduplication and archiving.

    Most of the leading data backup applications now include source dedu-

    plication, including CA Inc.s ArcServe Backup, CommVault System Inc.s

    Simpana, EMC Corp.s Avamar, IBM Corp.s Tivoli Storage Manager (TSM),

    and Symantec Corp.s Backup Exec and NetBackup. ICC is a Symantec

    Source-baseddeduplication explained

    Most of the leading data backup apps have sourcededuplication. Find out whether or not source dedupe

    can help your organization shorten your backupwindows, and learn about the pros and cons of

    source vs. target deduplication. By Andrew Burton

    proscons

    proscon

    p

    roscon

  • 8/13/2019 0711 ST eGuide DataDeduplicationBackup v2

    11/26

    STORAGE

    SearchDataBackup.com Essential Guide to Data Deduplication and Backup

    Backup Exec shop, and they back up to a Dell PowerVault MD1000 direct-at-

    tached storage (DAS) array onsite and use the Symantec Online BackupService for off-site backup. We were backing up between 350 GB and 375 GB

    for full backups, Cooper said. When we started deduping, we saved about 50

    GB [on a full backup], so that was a huge benefit.

    And, though they were using

    Backup Exec before they decided to

    run deduplication, Cooper took his

    time evaluating how running source

    deduplication would affect the rest

    of his environment. The decision to

    use it wasnt as simple as just turn-

    ing it on. He said, I wish it was thatsimple. I wish [products] always just

    worked. Then itd be so easy, like oh

    new product? This is great.

    It was a combination of things.

    Whats the cost? Whats the learn-

    ing curve? Whats the cost of po-

    tentially expanding out my servers?

    In the end, it made more sense to

    use what we had more efficiently

    rather than buying more or bigger systems, he added.

    BEST ENVIRONMENTS FOR SOURCE DEDUPE

    As the name implies, source dedupe products process deduplication on the

    application, or client, server before sending data across a network to the back-

    up target. This is a compelling benefit for users looking to alleviate bandwidth

    constraints. For example, a company might choose source-based dedupe when

    backing up a remote office to a central data center, reducing the amount of data

    they have to send across the WAN. This is a major driver for source dedupe to-

    day, according to Jeff Boles, senior analyst with the Taneja Group. If youve only

    got a couple machines at a remote office, maybe you dont see a need to invest

    in an expensive Riverbed-type WAN optimization appliance, he said.Within the data center, reducing data at the source can take a lot of strain off

    your local network. This can be particularly useful in virtualized environments.

    If you look at the data within a virtual machine disk file, there is a lot of

    redundancy across virtual machines, said Lauren Whitehouse, senior analyst

    with Enterprise Strategy Group. If you look at a physical system, you have

    your operating system running once and then whatever applications and data.

    On a host system running multiple virtual machines, that operating system

    exists multiple times.

    dupe tutorial

    Source-sidededupe

    Restoringeduped data

    oubleshooting

    Globaleduplication

    Sponsor

    resources

    11

    We were backing upbetween 350 GB and375 GB for full backups.When we starteddeduping, we savedabout 50 GB [on a fullbackup], so that wasa huge benefit.

    JULIAN COOPER, senior IT administrator,

    Integrated Control Corp.

  • 8/13/2019 0711 ST eGuide DataDeduplicationBackup v2

    12/26

    STORAGE

    SearchDataBackup.com Essential Guide to Data Deduplication and Backup

    However, there is a tradeoff, said Whitehouse. The I/O processing to per-

    form the deduplication might put a strain on the physical server being shared bythe virtual machines. She went on to say that while you might see resource con-

    tention while the backup is running, deduplicated backups take considerably less

    time to complete. So, its a matter of weighing one against the other.

    Cooper is currently evaluating just that in ICCs environment. He has deployed

    server virtualization using Microsofts Hyper-V, but has not virtualized tier-one

    critical apps yet. The main thing I want to see is the performance impact of

    deduping on the virtual system itself, he said. Anything you add on to the virtual

    system takes resources, so you have to find the balance between the benefits

    you see and the strain you are putting on the server. On the flip side, [source

    deduplication] reduces the horsepower you need in your storage destination, be-

    cause the target is not responsible for processing the deduplication, said Boles.Cooper also said that he uses Symantecs backup reporting tools to further

    optimize backups. The reporting makes things so much easier when, say, we

    have an extra 40 gigs in a month, he said. Whats included in the backup? Is it

    an anomaly? Is it natural growth? Can we archive some of this? It helps us un-

    derstand whats going on in our environment.

    SOURCE VS. TARGET DEDUPE

    Certain organizations may not be

    able to live with the performance

    hit on physical servers that comeswith source dedupe. If you have a

    high performance, [and are] pro-

    cessing [an] intensive environment

    that doesnt have a lot of down-

    time, you might be hard pressed to

    implement source dedupe graceful-

    ly, said Boles. That type of environ-

    ment is generally better served by

    a target-based system such as

    Quantum Corp.s DXi series, IBMs

    ProtectTier, NEC Corp.s Hydrastorseries, FalconStor Software Inc.s

    File-interface Deduplication System

    (FDS), or EMCs Data Domain series.

    Target deduplication may also be a better fit for an organization running

    a backup application that does not have built-in deduplication capabilities

    or running multiple backup applications. Some organizations may find that

    source makes sense for some backup jobs and target makes sense for others.

    Today, some backup software vendors, such as CommVault and Symantec,

    dupe tutorial

    Source-sidededupe

    Restoringeduped data

    oubleshooting

    Globaleduplication

    Sponsor

    resources

    12

    If you have a high

    performance, [and are]processing [an] inten-sive environment thatdoesnt have a lot ofdowntime, you mightbe hard pressed toimplement source

    dedupe gracefully.JEFF BOLES, senior analyst, Taneja Group

  • 8/13/2019 0711 ST eGuide DataDeduplicationBackup v2

    13/26

    STORAGE

    SearchDataBackup.com Essential Guide to Data Deduplication and Backup

    are responding to this need in the market, offering products that can perform

    deduplication at the source or at the backup target. However, these productsare just emerging.

    Finally, there is a perception that source-based deduplication is cheaper

    than target-based deduplication. The fact that it is built into a lot of backup

    software solutions may contribute to the perception that it is cheaper, said

    Whitehouse. But most of those vendors typically charge extra for that feature.

    And, just because you are not buying a target deduplication system doesnt

    mean that you dont need storage. I think if you look at the total cost of owner-

    ship it might be cheaper, but not significantly so. 2

    Andrew Burton is the site editor of SearchDataBackup.

    dupe tutorial

    Source-sidededupe

    Restoringeduped data

    oubleshooting

    Globaleduplication

    Sponsor

    resources

    13

  • 8/13/2019 0711 ST eGuide DataDeduplicationBackup v2

    14/26SearchDataBackup.com Essential Guide to Data Deduplication and Backup

    dupe tutorial

    Source-sidededupe

    Restoringeduped data

    oubleshooting

    Globaleduplication

    Sponsor

    resources

    14

    rESTORES FROM SOME data dedupe systems can be slower than modern tapedrives, and restores from other deduplication systems can be faster than tapecould ever dream of. In this column, you will learn what you need to do tomake sure your dedupe system is on the right side of those two extremes.To explain why there is what many call a dedupe tax, we need to go backin time to when there was no data deduplication technology. Before dedupe,we wrote data to tape or disk in contiguous blocks. Writing contiguous blocks

    Restoring dedupeddata in backupenvironments

    Restoring deduped data in some deduplication systemscan be slower than restores from tape. Find out howto analyze your dedupe systems performance and

    how to speed up your restores. By W. Curtis Preston

  • 8/13/2019 0711 ST eGuide DataDeduplicationBackup v2

    15/26

    STORAGE

    SearchDataBackup.com Essential Guide to Data Deduplication and Backup

    of backup data to disk requires either an empty filesystem, or a disk system

    designed for that purpose, such as a virtual tape library (VTL). The blocks thatcomprised a given backup were all located in proximity to each other.

    Data backup systems also occasionally perform a full backup, a synthetic

    full backup, or otherwise collocate files necessary for a complete restore

    (e.g., IBM Tivoli Storage Managers

    active data pools). These typical

    behaviors of backup systems meant

    that the bulk of the blocks needed to

    restore a given system would all be

    placed contiguously on disk or tape,

    making a complete restore of that

    system very easy to accomplish.Most would agree that the best thing

    for fast restores would be to have a

    recent full or (updated active data pool)

    on disk, and to have any subsequent

    backups also on disk. This is why it

    comes as a surprise to many when they

    read that restores from a deduplication

    system (which is almost always on disk)

    could be slower than what theyre used toeven slower than from tape.

    The reason for this is that no matter how you back up your data to a dedu-

    plication system, it is rarely stored contiguously on disk. Since the blocks thatcomprise the latest backup were actually created over time, those blocks will

    be stored all over the dedupe system, based on when they were backed up.

    A restore from deduped data is therefore a very fragmented read from disk.

    Instead of a single disk seek followed by a large disk read of hundreds or

    thousands of blocks, you could have hundreds of disk seeks and reads from

    hundreds of different disk drives.

    ANALYZING YOUR DEDUPLICATION SYSTEMS PERFORMANCE

    So how is it that some deduplication systems will have better read perform-

    ance than other systems? It depends on a few things. The first factor is themanner in which the systems store data each day and the degree to which

    their data storage method fragments the data. The second factor is the degree

    to which they do things to mitigate the negative restore effects of their data

    storage methods. Finally, some dedupe systems may have a single stream

    limitation that can impact their restore speed.

    The first thing that affects how dedupe systems store data is what they do

    when they find a match to an older segment of data. Do they leave the old

    dupe tutorial

    Source-sidededupe

    Restoringeduped data

    oubleshooting

    Globaleduplication

    Sponsor

    resources

    15

    Most would agreethat the best thing forfast restores wouldbe to have a recentfull or (updated activedata pool) on disk,and to have anysubsequent backupsalso on disk.

  • 8/13/2019 0711 ST eGuide DataDeduplicationBackup v2

    16/26

    STORAGE

    SearchDataBackup.com Essential Guide to Data Deduplication and Backup

    segment in place and write a pointer to the older segment instead of writing

    the new segment to disk (reverse referencing)? Or do they write the newsegment to disk, delete the older segment and replace it with a pointer to the

    new segment (forward referencing)? Deduplication systems that continually

    write pointers for newer, redundant segments may result in more fragmentation

    for newer data. Systems that always write newer data and delete older data

    may result in more contiguously stored data from newer backups. Forward

    referencing is only possible in post-processing deduplication systems. This

    method of storing data should result in faster restores of more recent data,

    which is where most restores come from.

    The next thing to consider is whether

    the system tries to collocate segments

    that it can collocate. When a system isaccepting a stream of backup, it may

    have some knowledge that certain

    data segments are associated with

    other data segments and do its best to

    try to store them in a way that allows

    for more contiguous storage of related

    segments.

    Another thing that may impact how

    data is stored is how the dedupe

    system does what is commonly

    referred to as garbage collection. As data is expired, some data segmentswill no longer be needed and can be discarded. Some vendors delete such

    segments with no regard to how this increases the fragmentation of the

    remaining data. Other vendors consider such things during their garbage

    collection process, which is typically run on a daily basis.

    MITIGATING THE DATA DEDUPE TAX

    Vendors do a number of things to mitigate the dedupe tax. Some vendors do

    extra work during their garbage collection process to actually relocate related

    segments together, especially those segments that do not have anything in

    common with other backups. Think of this as a fancy defragmentation process.Other vendors keep last nights backup stored on disk in its native format so that

    if someone asks to restore (or copy) from last nights backup, they can satisfy

    that need without bringing dedupe into the picture. While this is very desirable

    from a restore speed perspective, it is another technology that requires a post-

    process architecture, which carries with it the concept of a landing zone where

    backups are stored in their native format before they are deduped. That landing

    zone will require extra disk that is not required in an inline configuration.

    dupe tutorial

    Source-sidededupe

    Restoringeduped data

    oubleshooting

    Globaleduplication

    Sponsor

    resources

    16

    Deduplication systemsthat continually writepointers for newer,redundant segmentsmay result in morefragmentation fornewer data.

  • 8/13/2019 0711 ST eGuide DataDeduplicationBackup v2

    17/26

    STORAGE

    SearchDataBackup.com Essential Guide to Data Deduplication and Backup

    Finally, some vendors also suffer from a single stream limitation. That is,

    their architecture does not allow a single stream of data out of their productany faster thann MBps. If you plan on copying data from a deduplication system

    to tape, its single stream restore speed capabilities are paramount. This is

    because a tape copy is essentially a very demanding restore; not only does

    the device have to reassemble all the appropriate bits back to their native

    form, they must do so in a very fast single stream of data. There is no point

    in trying to copy data to an LTO-5 tape drive that wants 240 MBps (assuming

    1.5:1 compression) if the fastest your deduplication system can supply is 90 MBps.

    Suffice it to say that not all dedupe systems are created equal when it comes

    to restore speed. Make sure you do your homework. 2

    W. Curtis Preston is an independent backup expert.

    dupe tutorial

    Source-sidededupe

    Restoringeduped data

    oubleshooting

    Globaleduplication

    Sponsor

    resources

    17

  • 8/13/2019 0711 ST eGuide DataDeduplicationBackup v2

    18/26SearchDataBackup.com Essential Guide to Data Deduplication and Backup

    TS BEEN SAIDthat we never really solve any problems in ITwe just

    move them. Data deduplication is no exception to that rule. While

    deduplication systems have helped make data backup and recovery

    much easier, they also come with a number of challenges. The savvy

    storage or backup administrator will familiarize themselves with

    these challenges and do whatever they can to work around them.

    Your backup system creates duplicate data in three different ways:

    repeated full backups of the same file system or application; repeated

    incremental backups of the same file or application; and backups offiles that happen to reside in multiple places (e.g., the same

    OS/application on multiple machines). Hash-based deduplication systems

    (e.g., CommVault Systems Inc., EMC Corp., FalconStor Software, Quantum

    Corp., Symantec Corp.) will identify and eliminate all three types of duplicate

    data, but their level of granularity is limited to their chunk size, which is typi-

    cally 8K or larger. Delta-differential-based deduplication systems (e.g., IBM

    Corp., ExaGrid Systems, Sepaton Inc.) will only identify and eliminate the

    first two types of duplicate data, but their level of granularity can be as

    dupe tutorial

    Source-sidededupe

    Restoringeduped data

    oubleshooting

    Globaleduplication

    Sponsor

    resources

    18

    Solvingcommon datadeduplication

    systemproblemsStill not getting the deduplication ratios vendors

    are promising? Learn about the most commonproblems with deduplication systems and

    how to fix them. By W. Curtis Preston

    i

  • 8/13/2019 0711 ST eGuide DataDeduplicationBackup v2

    19/26

    STORAGE

    SearchDataBackup.com Essential Guide to Data Deduplication and Backup

    small as a few bytes. These differences typically result in a dedupe ratio

    draw, but can yield significant differences in certain environments, which iswhy most experts suggest you test multiple products.

    Because roughly half of the duplicate data in most backup data comes from

    multiple full backups, people using IBM Tivoli Storage Manager (TSM) as their

    backup product will experience lower deduplication ratios than customers

    using other backup products. This is due to TSMs progressive incremental

    feature that allows users to never again do a full backup on file systems

    being backed up by TSM. However, because TSM users perform full backups

    on their databases and applications, and because full backups arent the only

    place where duplicate data is found, TSM users can still benefit from dedupli-

    cation systemstheir dedupe ratios will simply be smaller.

    The second type of duplicate datacomes from incremental backups,

    which contain versions of files or ap-

    plications that have changed since the

    most recent full backup. If a file is

    modified and backed up every day,

    and the backup system retains back-

    ups for 90 days, there will be 90

    versions of that file in the backup

    system. A deduplication system will

    identify the segments of data that are

    unique and redundant among those 90different versions and store only the

    unique segments. However, there are

    file types that do not have different

    versions (such as photos or imaging

    data and PDF files); every file is unique

    unto itself and is not a version of a previous iteration of the same file. An

    incremental backup that contains these types of files contains completely

    unique data, so there is nothing to deduplicate them against. Since there is

    a cost associated with deduplicated storage, customers with significant

    portions of such files should consider not storing them on a deduplication

    system, as they will gain no benefit and only increase their cost.

    DATA DEDUPLICATION SYSTEMS AND ENCRYPTION: WHAT TO WATCH OUT FOR

    Data deduplication systems work by finding and eliminating patterns; encryp-

    tion systems work by eliminating patterns. Do not encrypt backup data before

    it is sent to the deduplication systemyour deduplication ratio will be 1:1.

    Compression works a little like encryption in that it also finds and eliminates

    patterns, but in a very different way. The way most compression systems work

    dupe tutorial

    Source-sidededupe

    Restoringeduped data

    oubleshooting

    Globaleduplication

    Sponsor

    resources

    19

    A deduplicationsystem will identifythe segments of datathat are unique andredundant amongthose 90 different

    versions and storeonly the uniquesegments.

  • 8/13/2019 0711 ST eGuide DataDeduplicationBackup v2

    20/26

    STORAGE

    SearchDataBackup.com Essential Guide to Data Deduplication and Backup

    results in a scrambling of the data that has a similar effect as encryption; it

    can also completely remove all abilities of your deduplication system todeduplicate the data.

    The compression challenge often results in a stalemate between database

    administrators who want their backups to go faster and backup admins who

    want their backups to get deduped. Since databases are often created with

    very large capacities and very small actual amounts of data, they tend to

    compress very well. This is why turning on the backup compression feature

    often results in database backups that go two to four times faster than they do

    without compression. The only way to get around this particular challenge is to

    use a backup software product that has integrated source dedupe and client

    compression, such as CommVault Simpana, IBM TSM or Symantec NetBackup.

    MULTIPLEXING AND DEDUPLICATION SYSTEMS

    The next dedupe challenge with backups only applies to companies using virtual

    tape libraries (VTLs) and backup software that supports multiplexing. Multiplex-

    ing several different backups to the same tape drive can also scramble the data

    and completely confound all dedupe. Even products that are able to decipher the

    different backup streams from a multiplexed image (e.g., FalconStor, Sepaton)

    tell you not to multiplex backups to their devices because it simply wastes time.

    CONSIDER THE DEDUPE TAXThe final backup dedupe challenge has

    to do with the backup window. The

    way that some deduplication systems

    perform the dedupe task actually re-

    sults in a slow-down of the incoming

    backup. Most people dont notice this

    because they are moving from tape to

    disk, and a dedupe system is still

    faster. However, users who are already

    using disk staging may notice a reduc-

    tion in backup performance and an increase in the amount of time it takes to

    back up their data. Not all products have this particular characteristic and

    the ones that do demonstrate it in varying degreesonly a proof-of-concept

    test in your environment will let you know for sure.

    The restore challenge is much easier to understand; the way most dedupli-

    cation systems store data results in the most recent backups being written

    in a fragmented way. Restoring deduplicated backups may therefore take

    longer than it would have taken if the backup had not been deduplicated. This

    phenomenon is referred to as the dedupe tax.

    dupe tutorial

    Source-sidededupe

    Restoringeduped data

    oubleshooting

    Globaleduplication

    Sponsor

    resources

    20

    The way that somededuplication systemsperform the dedupetask actually resultsin a slow-down of theincoming backup.

  • 8/13/2019 0711 ST eGuide DataDeduplicationBackup v2

    21/26

    STORAGE

    SearchDataBackup.com Essential Guide to Data Deduplication and Backup

    When considering the dedupe tax, think about whether or not youre planning

    to use the dedupe system as the source for tape copies, because it is during

    large restores and tape copies that the dedupe tax is most prevalent. Suppose,

    for example, that you plan on using LTO-5 drives that have a native speed of

    140 MBps and a native capacity of 1.5 TB. Suppose also that you have examined

    your full backup tapes and have discovered that you consistently fit 2.25 TB

    of data on your 1.5 TB tapes, meaning that youre getting a 1.5:1 compression

    ratio. This means that your 140 MBps tape drive should be running at roughly

    210 MBps during copies. Make sure that during your proof of concept you

    verify that the dedupe system is able to provide the required performance

    (210 MBps in this example). If it cannot, you may want to consider a different

    system.

    The final challenge with deduplicated restores is that they are still restores,

    which is why dedupe is not a panacea. A large system that must be restored

    still requires a bulk copy of data from the dedupe system to the production

    system. Only a total architectural change of your backup system from tradi-

    tional backup to something like continuous data protection (CDP) or near-CDP

    can address this particular challenge, as they offer restore times measured in

    seconds not hours.

    Data deduplication systems offer the best hope for making significant

    enhancements to your current backup and recovery system without making

    wholesale architectural changes. Just be sure that you are aware of the

    challenges of dedupe before you sign a purchase order. 2

    W. Curtis Preston is an independent backup expert.

    dupe tutorial

    Source-sidededupe

    Restoringeduped data

    oubleshooting

    Globaleduplication

    Sponsor

    resources

    21

  • 8/13/2019 0711 ST eGuide DataDeduplicationBackup v2

    22/26

    dSearchDataBackup.com Essential Guide to Data Deduplication and Backup

    EDUPLICATION INITIALLY GAINEDprominence in disk backup appliances almost10 years ago. A number of improvements have occurred along the way that

    upgraded its overall performance and effectiveness in backup systems as

    well as in primary storage. Global deduplication, developed as a method to

    increase efficiency and enable greater scalability, has been integrated into

    products in a number of categories.

    Briefly, the dedupe process parses an input data stream into (typically)

    sub-file-sized blocks, runs a hashing algorithm on them (somewhat like a

    checksum) and creates a unique identifier for each. These hash keys are

    dupe tutorial

    Source-sidededupe

    Restoringeduped data

    oubleshooting

    Globaleduplication

    Sponsor

    resources

    22

    Global

    deduplicationimplementation

    varies by productGlobal deduplication has been integrated in a

    number of dedupe products. Discover whatglobal really means and the benefits global

    dedupe can provide. By Eric Slack

  • 8/13/2019 0711 ST eGuide DataDeduplicationBackup v2

    23/26

    STORAGE

    SearchDataBackup.com Essential Guide to Data Deduplication and Backup

    then stored in an index or hash table, which is used to compare subsequent

    data blocks and determine which blocks are duplicates. When a duplicate isencountered, a pointer to the existing block is created, instead of storing the

    block a second time. This way, only unique blocks and hash keys are stored

    and redundant blocks are eliminated, or deduplicated, from the data set.

    A dedupe systems effectiveness is a function of its ability to find dupli-

    cate blocks, which, in turn, is directly related to the size of the pool of blocks

    it can store and represent in the hash table. In general, more blocks and a

    larger hash table means better dedu-

    plication. Also, dedupe systems need

    to scale as storage growth continues,

    without impacting performance, fur-

    ther driving the need for a larger pool ofdata blocks to support the dedupe

    process. Global deduplication is the

    way many dedupe vendors are ad-

    dressing this requirement.

    But exactly how to make that pool

    larger depends on the implementation

    (where the dedupe engine sits) and the

    architecture of the storage system its

    connected to. This has led to dedupe

    systems sharing hash keys in an effort

    to expand the number of blocks com-pared by different dedupe engines. Its

    also led to an expansion of block pools

    to support greatly scaled, clustered storage systems and a method for sharing

    the correspondingly large hash tables resident on multiple storage modules.

    DEFINING GLOBAL

    The term global doesnt refer to a consistent process or architecture when

    applied to deduplication. Its most commonly used as a relative term, meant

    to differentiate a process that makes use of an expanded or shared index

    (global dedupe) compared with one that has a single index (local dedupe).When implemented in backup software (most enterprise backup software

    applications do have some kind of global deduplication functionality), this

    index or hash table is shared among individual dedupe processing engines

    as a method for improving dedupe efficiency and reducing data handling. In

    hardware, its more typically a method for scaling the dedupe system, by

    sharing a larger pool of common blocks and a larger hash table among multi-

    ple controllers or storage modules; EMC Corp. Data Domain takes this ap-

    proach in global deduplication hardware.

    dupe tutorial

    Source-sidededupe

    Restoringeduped data

    oubleshooting

    Globaleduplication

    Sponsor

    resources

    23

    A dedupe systemseffectiveness is afunction of its abilityto find duplicateblocks, which, in turn,is directly related tothe size of the poolof blocks it can storeand represent in the

    hash table.

  • 8/13/2019 0711 ST eGuide DataDeduplicationBackup v2

    24/26

    STORAGE

    SearchDataBackup.com Essential Guide to Data Deduplication and Backup

    Some manufacturers have called their systems global when all theyve

    done is connect independent modules together with replication software, with-out actually sharing any blocks or an index, so these systems dont actually

    apply dedupe across the group of independent modules. While they certainly

    have the right to promote their products as they see fit, care must be taken

    to understand the functionality that they claim makes them global.

    GLOBAL DEDUPLICATION IN SOFTWARE

    In backup software, source-side dedupe runs on the client, or application,

    server, where data blocks are hashed and keys are created by the backup

    client. But each key is compared with a hash table thats stored on the back-

    up server, media agent or on dedicated backup storage hardwarenot on the

    client server. Unique blocks and keys are sent to these same devices, and

    duplicates are referenced by the client. All clients share the same global

    index and pool of unique blocks as opposed to earlier client-side local

    dedupe processes that only compared blocks within each client servers

    backup jobs. In general, source-side dedupe was made significantly more

    effective when a shared hash table was implemented. As a variant, some

    applications have the option to run software dedupe on the media agent or

    backup server instead of the client, a process called target-side dedupe,

    similar to the way hardware deduplication is implemented.

    GLOBAL DEDUPLICATION IN HARDWARE

    One of the earliest implementations of dedupe was within a dedicated backup

    appliance, which connected to the backup server and presented itself as a NAS

    device or virtual tape library (VTL). It was essentially a large local dedupe system,

    since the process was performed in a single box. In response to the need to

    scale, however, some of these target-side dedupe hardware vendors have also

    come out with their version of global dedupe. These systems essentially com-

    bine two separate dedupe processors with an expanded storage capacity and

    share the hash table between them, enabling them to scale into the low hun-

    dreds of terabytes. There are other hardware-based dedupe systems that haveleveraged this same shared-controller/shared-hash-table design but expanded

    it to a multiple-node architecture to support even larger data sets.

    Similar to the target-side dedupe appliance, there are backup systems that

    scale by adding modules or nodes. The dedupe processing is done by a dedi-

    cated node or nodes, which share the index across all the nodes storing data

    blocks. These systems, which are presented as a large NAS or VTL to the back-

    up software, can scale into the petabyte range.

    Some clustered storage systems, also called object-based storage, now

    dupe tutorial

    Source-sidededupe

    Restoringeduped data

    oubleshooting

    Globaleduplication

    Sponsor

    resources

    24

  • 8/13/2019 0711 ST eGuide DataDeduplicationBackup v2

    25/26

    STORAGE

    SearchDataBackup.com Essential Guide to Data Deduplication and Backup

    offer deduplication but differ from clustered backup appliances. This node-based

    topology supports extremely large and distributed infrastructure, and its use ofdata objects instead of files is well-suited for distributed, global deduplication.

    These systems typically run the hash calculation on objects within each node,

    compiling a hash index for the node, but share its access with other nodes.

    Not specifically designed for backup, they represent one way that dedupe has

    moved into primary storage.

    As a VAR, understanding various vendors products in a space like dedupli-

    cation is essential, since VARs are frequently called upon by customers to

    explain the differences between technologies that use the same identifier,

    like deduplication. Adding the label global to a dedupe system most often

    indicates that its more efficient than a similar local system or that it can

    scale larger without degrading performance. To the extent that scalability isneeded, a global data deduplication system would be more desirable than a

    local dedupe system, provided the cost is appropriate. 2

    Eric Slack, a senior analyst for Storage Switzerland, has more than 20 years of

    experience in high-technology industries holding technical management and

    marketing/sales positions in the computer storage, instrumentation, digital

    imaging and test equipment fields.

    dupe tutorial

    Source-sidededupe

    Restoringeduped data

    oubleshooting

    Globaleduplication

    Sponsor

    resources

    25

  • 8/13/2019 0711 ST eGuide DataDeduplicationBackup v2

    26/26

    SPONSOR RESOURCES

    SAN For Dummies Chapter 13 Deduplication

    Using FalconStor FDS as a Backup Target for Veritas NetBackup

    How Data Deduplication Works

    See ad page 2

    Data Deduplication for Dummies

    Quantum Targets Disk-Based Backup Price-Performance Leadership with New DXi 2.0

    7 Deduplication Questions to Consider Before Deploying for the Midrange Data Center

    http://www.bitpipe.com/detail/RES/1303488742_625.htmlhttp://www.bitpipe.com/detail/RES/1252600407_389.htmlhttp://www.bitpipe.com/detail/RES/1303407541_435.htmlhttp://www.bitpipe.com/detail/RES/1294781626_191.htmlhttp://www.bitpipe.com/detail/RES/1297351891_722.htmlhttp://www.bitpipe.com/detail/RES/1285096645_268.htmlhttp://www.bitpipe.com/detail/RES/1285096645_268.htmlhttp://www.bitpipe.com/detail/RES/1297351891_722.htmlhttp://www.bitpipe.com/detail/RES/1294781626_191.htmlhttp://quantum.com/http://www.bitpipe.com/detail/RES/1303407541_435.htmlhttp://www.bitpipe.com/detail/RES/1252600407_389.htmlhttp://www.bitpipe.com/detail/RES/1303488742_625.htmlhttp://www.falconstor.com/