a tale of two erasure codes in hdfs -...

32
Dynamo A Tale of Two Erasure Codes in HDFS 1 Mingyuan Xia * , Mohit Saxena + , Mario Blaum + , and David A. Pease + * McGill University, + IBM Research Almaden FAST15 何军权 2015-04-30

Upload: others

Post on 07-Nov-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:

Dynamo

A Tale of Two Erasure Codes in HDFS

1

Mingyuan Xia*, Mohit Saxena+ ,

Mario Blaum+, and David A. Pease+

*McGill University, +IBM Research Almaden

FAST’15

何军权 2015-04-30

Page 2: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:

22

Outline

Introduction & Motivation

Design

Evaluation

Conclustions

Related work

Page 3: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:

33

Introduction & Motivation

Page 4: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:

4

Big Data Storage

Reliability and Availability

Replication: 3-way replication

Erasure Code: Reed-Solomon(RS), LRC

4

GFS3-way replication

3x, 2003

FB HDFSRS, 1.4x, 2011

GFS v2RS, 1.5x, 2012

AzureLRC, 1.33x, 2012

FB HDFSLRC, 1.66x, 2013

Page 5: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:

5

Popular Erasure Code Families

Product Code(PC)

Local Reconstruction Code(LRC)

Other

5

a0 a1 a2 a3 a4 a5 G1

a6 a7 a8 a9 a10 a11 G2

L0 L1 L2 L3 L4 L5

PC LRC

Reed-Solomon(RS)

a0 a1 a2 a3 a4 ha

b0 b1 b2 b3 b4 hb

P0 P1 P2 P3 P4 h

Page 6: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:

6

Erasure Code

Facebook HDFS RS(10,4)

Compute 4 parities per 10 data blocks

All blocks store in different storage nodes

Storage Overhead: 1.4x

D10

D1 D2 D3 D4 D5

D6 D7 D8 D9

P1 P2 P3 P4

Page 7: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:

7

Erasure Code

High Degraded Read Latency

Read to an unavailable block requires

Multiple disk reads, network transfers and compute cycles to

decode

…HDFS

Read

exception

Client

Page 8: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:

8

Erasure Code

Long Reconstruction Time Facebook's Cluster:

100K blocks lost per day

50 machine-unavailablility events per day

Reconstruction traffic: 180TB per day

…HDFS

Reconstruction Job

Page 9: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:

9

Erasure Code

Degraded

Read Latency

Recover Cost

Recover Cost: the total number of blocks required to reconstruction a data block after failure

Reconstruction Time

Page 10: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:

10

Recovery Cost vs. Storage Overhead

Conclusion

Storage Overhead and Reconstruction Cost are a tradeoff in

single erasure code.

FB HDFS RS

GFS v2 RS

Azure LRC

FB HDFS LRC GFS 3-way Repl

Page 11: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:

1111

How to balance?

Storage Overhead Recovery Cost

Page 12: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:

12

Data Access Skew

Conclusions Only few data are "hot"

P(freq > 10) ~= 1%

Most data are "cold" P(freq <= 10) ~= 99%

12

Page 13: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:

13

Data Access Skew

Hot data

High access frequency

A small fraction of data

Cold data

Low access frequency

A major fraction of data

13

A little improvement on read can

gain a high read performance

A few less of data to store can

save huge storage space

Hot Data: Decrease the Recovery Cost

Cold Data: High Storage Efficiency

Page 14: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:

14

HACFS

System State

Tracks file states

File size, last mTime

Read count and coding state

Adapting Coding

Tracks system states

Choose coding scheme

based on read count and

mTime

Erasure Coding

Providing four coding

interfaces

Encode/Decode

Upcode/Downcode

Page 15: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:

15

Erasure Coding Algorithms

Two different erasure codes

Fast code:

Encode the frequently accessed blocks to reduce the read latency

and reconstruction time

Provide overall low recovery cost

Compact code:

Encode the less frequently accessed blocks to get low storage

overhead

Maintain a low and bounded storage overhead

15

Page 16: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:

16

State Transition

3-way

replication

Fast

Code

Compact

Code

Recently

created

HACFS

Write cold

COND'

COND

COND

COND : Read Hot and Bounded

COND': Read Cold or Not Bounded

COND'

Page 17: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:

17

Fast and Compact Product Codes(1)

17

a0 a1 a2 a3 a4 ha1

a5 a6 a7 a8 a9 ha2

Pa0 Pa1 Pa2 Pa3 Pa4 Pha

Fast Code

(Product Code 2x5)

Storage overhead: 1.8x

Recovery Cost: 2

a0 a1 a2 a3 a4 ha1

a5 a6 a7 a8 a9 ha2

b0 b1 b2 b3 b4 hb1

b5 b6 b7 b8 b9 hb2

c0 c1 c2 c3 c4 hc1

c5 c6 c7 c8 c9 hc2

P0 P1 P2 P3 P4 Ph

Compact Code

(Product Code 6x5)

Storage overhead: 1.4x

• ha1=RS(a0,a1,a2,a3,a4)

• Pa0=XOR(a0,a5)

Page 18: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:

18

Fast and Compact Product Codes(2)

18

a0 a1 a2 a3 a4 ha1

a5 a6 a7 a8 a9 ha2

Pa0 Pa1 Pa2 Pa3 Pa4 Pha

Fast Code

(Product Code 2x5)

Storage overhead: 1.8x

Recovery Cost: 2

a0 a1 a2 a3 a4 ha1

a5 a6 a7 a8 a9 ha2

b0 b1 b2 b3 b4 hb1

b5 b6 b7 b8 b9 hb2

c0 c1 c2 c3 c4 hc1

c5 c6 c7 c8 c9 hc2

P0 P1 P2 P3 P4 Ph

Compact Code

(Product Code 6x5)

Storage overhead: 1.4x

Recovery Cost: 5

• P0=XOR(a0,a5,b0,b5,c0,c5)

• ha1=RS(a0,a1,a2,a3,a4)

• Pa0=XOR(a0,a5)

Page 19: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:

19

Fast and Compact LRC(1)

19

a0 a1 a2 a3 a4 a5 G1

a6 a7 a8 a9 a10 a11 G2

L0 L1 L2 L3 L4 L5

Fast Code

(LRC(12,6,2))

Storage overhead: 20/12=1.67x

a0 a1 a2 a3 a4 a5 G1

a6 a7 a8 a9 a10 a11 G2

L0 L1

Compact Code

(LRC(12,2,2))

Storage overhead: 16/12=1.33x

Recovery Cost: 2 Recovery Cost: 6

{G1,G2}=RS(a0,a1,..,a11)

Li=XOR(ai, ai+6){G1,G2}=RS(a0,a1,..,a11)

Li=RS'(a0, a1, a2, a6, a7, a8)

Page 20: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:

2020

Upcoding for Product Codes

b0 b1 b2 b3 b4 hb1

b5 b6 b7 b8 b9 hb2

Pb0 Pb1 Pb2 Pb3 Pb4 Phb

a0 a1 a2 a3 a4 ha1

a5 a6 a7 a8 a9 ha2

Pa0 Pa1 Pa2 Pa3 Pa4 Pha

c0 c1 c2 c3 c4 hc1

c5 c6 c7 c8 c9 hc2

Pc0 Pc1 Pc2 Pc3 Pc4 Phc

a0 a1 a2 a3 a4 ha1

a5 a6 a7 a8 a9 ha2

b0 b1 b2 b3 b4 hb1

b5 b6 b7 b8 b9 hb2

c0 c1 c2 c3 c4 hc1

c5 c6 c7 c8 c9 hc2

P0 P1 P2 P3 P4 Ph

Fast Code

PC(2x5)

Compact Code

PC(6x5)

• Parities h require no re-construction

• Parities P require no data block transfer

• All parities updates can be done in parallel

Page 21: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:

2121

Downcoding for Product Codes

b0 b1 b2 b3 b4 hb1

b5 b6 b7 b8 b9 hb2

Pb0 Pb1 Pb2 Pb3 Pb4 Phb

a0 a1 a2 a3 a4 ha1

a5 a6 a7 a8 a9 ha2

Pa0 Pa1 Pa2 Pa3 Pa4 Pha

c0 c1 c2 c3 c4 hc1

c5 c6 c7 c8 c9 hc2

Pc0 Pc1 Pc2 Pc3 Pc4 Phc

a0 a1 a2 a3 a4 ha1

a5 a6 a7 a8 a9 ha2

b0 b1 b2 b3 b4 hb1

b5 b6 b7 b8 b9 hb2

c0 c1 c2 c3 c4 hc1

c5 c6 c7 c8 c9 hc2

P0 P1 P2 P3 P4 Ph

Compact Code

PC(6x5)

Fast Code

PC(2x5)

• Pa0=XOR(a0,a5)

• Pc0=XOR(P0,Pa0,Pb0)

Page 22: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:

22

Evaluation Platform

CPU: Intel Xeon E5645 24 cores, 2.4GHz

Disk: 7.2K RPM, 6*2TB

Memory: 96GB

Network: 1Gbps NIC

Cluster size: 11 nodes

Workload

22

CC: Cloudera Customer FB: Facebook

Page 23: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:

23

Evaluation Metrics

Degraded read latency

Foreground read request latency

Reconstruction time

Background recovery for failures

Storage overhead

23

Page 24: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:

24

The Production systems: 16-21 seconds

HACFS: 10-14 seconds

Degraded Read Latency

Bounded the storage overhead of HACFS LRC and PC to 1.4 and 1.5

Page 25: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:

25

A disk with 100GB data failed HACFS-PC takes about 10-35 minutes less than Production

systems

HACFS-LRC is worse than RS(6,3) in GFS v2 To reconstruction global parities, HACFS-LRC need to read 12

blocks, but GFS v2 only 6 blocks

Reconstruction Time

Page 26: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:

26

System Comparison

Colossus FS:RS(6,3)-1.5x

HDFS-Raid: RS(10,4)-1.4x

Azure: LRC(12,2,2)-1.33x

26

HACFS-PC:

PC(2x5)-1.8x

PC(6x5)-1.4x

HACFS-LRC:

LRC(12,6,2)-1.67x

LRC(12,2,2)-1.33x

Page 27: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:

27

System Comparison

Colossus FS:RS(6,3)-1.5x

HDFS-Raid: RS(10,4)-1.4x

Azure: LRC(12,2,2)-1.33x

27

HACFS-PC:

PC(2x5)-1.8x

PC(6x5)-1.4x

HACFS-LRC:

LRC(12,6,2)-1.67x

LRC(12,2,2)-1.33x

lost block type

HACFS-PC HACFS-LRC Colossus FS HDFS-RAID Azure

data blockfast: 2 fast: 2

6 10 6comp: 5 comp: 6

global parityfast: 5 fast: 12

6 10 12comp: 6 comp: 12

Page 28: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:

28

System Comparison

Colossus FS:RS(6,3)-1.5x

HDFS-Raid: RS(10,4)-1.4x

Azure: LRC(12,2,2)-1.33x

28

HACFS-PC:

PC(2x5)-1.8x

PC(6x5)-1.4x

HACFS-LRC:

LRC(12,6,2)-1.67x

LRC(12,2,2)-1.33x

lost block type

HACFS-PC HACFS-LRC Colossus FS HDFS-RAID Azure

data blockfast: 2 fast: 2

6 10 6comp: 5 comp: 6

global parityfast: 5 fast: 12

6 10 12comp: 6 comp: 12

Page 29: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:

29

Conclusions

By using Erasure code, a lot of storage space can be

saved.

The production systems using a single erasure code

can not balance the tradeoff between recovery cost

and storage overhead very well.

HACFS by using a dynamically adaptive coding can

provide both low recovery cost and storage overhead.

Page 30: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:

30

Related Work

f4 OSDI'14

Divide the cold and hot by the data age

XOR-based Erasure Code--FAST’12

Combination RS with XOR.

Minimum-Storage-Regeneration(MSR)

Minimizes network transfers during reconstruction.

Product-Matrix-Reconstruct-By-Transfer(PM-RBT)FAST’15

Optimal in terms of I/O, storage, and network bandwidth.

30

Page 31: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:

3131

Thank You!

Page 32: A Tale of Two Erasure Codes in HDFS - ict.ac.cnacs.ict.ac.cn/.../2015spring/A_Tale_of_Two_Erasure_Codes_in_HDFS.… · 4 Big Data Storage Reliability and Availability Replication:

32

Acknowledgment

Prof. Xiong

Zigang Zhang

Biao Ma

CAS– ICT – Storage System Group 32