a tale of two erasure codes in hdfs -...

Dynamo

A Tale of Two Erasure Codes in HDFS

1

Mingyuan Xia*, Mohit Saxena+ ,

Mario Blaum+, and David A. Pease+

*McGill University, +IBM Research Almaden

FAST’15

何军权 2015-04-30

22

Outline

Introduction & Motivation

Design

Evaluation

Conclustions

Related work

33

Introduction & Motivation

4

Big Data Storage

Reliability and Availability

Replication: 3-way replication

Erasure Code: Reed-Solomon(RS), LRC

4

GFS3-way replication

3x, 2003

FB HDFSRS, 1.4x, 2011

GFS v2RS, 1.5x, 2012

AzureLRC, 1.33x, 2012

FB HDFSLRC, 1.66x, 2013

5

Popular Erasure Code Families

Product Code(PC)

Local Reconstruction Code(LRC)

Other

5

a0 a1 a2 a3 a4 a5 G1

a6 a7 a8 a9 a10 a11 G2

L0 L1 L2 L3 L4 L5

PC LRC

Reed-Solomon(RS)

a0 a1 a2 a3 a4 ha

b0 b1 b2 b3 b4 hb

P0 P1 P2 P3 P4 h

6

Erasure Code

Facebook HDFS RS(10,4)

Compute 4 parities per 10 data blocks

All blocks store in different storage nodes

Storage Overhead: 1.4x

D10

D1 D2 D3 D4 D5

D6 D7 D8 D9

P1 P2 P3 P4

7

Erasure Code

High Degraded Read Latency

Read to an unavailable block requires

Multiple disk reads, network transfers and compute cycles to

decode

…HDFS

Read

exception

Client

8

Erasure Code

Long Reconstruction Time Facebook's Cluster:

100K blocks lost per day

50 machine-unavailablility events per day

Reconstruction traffic: 180TB per day

…HDFS

Reconstruction Job

9

Erasure Code

Degraded

Read Latency

Recover Cost

Recover Cost: the total number of blocks required to reconstruction a data block after failure

Reconstruction Time

10

Recovery Cost vs. Storage Overhead

Conclusion

Storage Overhead and Reconstruction Cost are a tradeoff in

single erasure code.

FB HDFS RS

GFS v2 RS

Azure LRC

FB HDFS LRC GFS 3-way Repl

1111

How to balance?

Storage Overhead Recovery Cost

12

Data Access Skew

Conclusions Only few data are "hot"

P(freq > 10) ~= 1%

Most data are "cold" P(freq <= 10) ~= 99%

12

13

Data Access Skew

Hot data

High access frequency

A small fraction of data

Cold data

Low access frequency

A major fraction of data

13

A little improvement on read can

gain a high read performance

A few less of data to store can

save huge storage space

Hot Data: Decrease the Recovery Cost

Cold Data: High Storage Efficiency

14

HACFS

System State

Tracks file states

File size, last mTime

Read count and coding state

Adapting Coding

Tracks system states

Choose coding scheme

based on read count and

mTime

Erasure Coding

Providing four coding

interfaces

Encode/Decode

Upcode/Downcode

15

Erasure Coding Algorithms

Two different erasure codes

Fast code:

Encode the frequently accessed blocks to reduce the read latency

and reconstruction time

Provide overall low recovery cost

Compact code:

Encode the less frequently accessed blocks to get low storage

overhead

Maintain a low and bounded storage overhead

15

16

State Transition

3-way

replication

Fast

Code

Compact

Code

Recently

created

HACFS

Write cold

COND'

COND

COND

COND : Read Hot and Bounded

COND': Read Cold or Not Bounded

COND'

17

Fast and Compact Product Codes(1)

17

a0 a1 a2 a3 a4 ha1

a5 a6 a7 a8 a9 ha2

Pa0 Pa1 Pa2 Pa3 Pa4 Pha

Fast Code

(Product Code 2x5)

Storage overhead: 1.8x

Recovery Cost: 2

a0 a1 a2 a3 a4 ha1

a5 a6 a7 a8 a9 ha2

b0 b1 b2 b3 b4 hb1

b5 b6 b7 b8 b9 hb2

c0 c1 c2 c3 c4 hc1

c5 c6 c7 c8 c9 hc2

P0 P1 P2 P3 P4 Ph

Compact Code

(Product Code 6x5)


• ha1=RS(a0,a1,a2,a3,a4)

• Pa0=XOR(a0,a5)

18

Fast and Compact Product Codes(2)

18

a0 a1 a2 a3 a4 ha1

a5 a6 a7 a8 a9 ha2


Fast Code

(Product Code 2x5)


Recovery Cost: 2

a0 a1 a2 a3 a4 ha1

a5 a6 a7 a8 a9 ha2

b0 b1 b2 b3 b4 hb1

b5 b6 b7 b8 b9 hb2

c0 c1 c2 c3 c4 hc1

c5 c6 c7 c8 c9 hc2

P0 P1 P2 P3 P4 Ph

Compact Code

(Product Code 6x5)


Recovery Cost: 5

• P0=XOR(a0,a5,b0,b5,c0,c5)

• ha1=RS(a0,a1,a2,a3,a4)

• Pa0=XOR(a0,a5)

19

Fast and Compact LRC(1)

19

a0 a1 a2 a3 a4 a5 G1

a6 a7 a8 a9 a10 a11 G2

L0 L1 L2 L3 L4 L5

Fast Code

(LRC(12,6,2))

Storage overhead: 20/12=1.67x

a0 a1 a2 a3 a4 a5 G1

a6 a7 a8 a9 a10 a11 G2

L0 L1

Compact Code

(LRC(12,2,2))

Storage overhead: 16/12=1.33x

Recovery Cost: 2 Recovery Cost: 6

{G1,G2}=RS(a0,a1,..,a11)

Li=XOR(ai, ai+6){G1,G2}=RS(a0,a1,..,a11)

Li=RS'(a0, a1, a2, a6, a7, a8)

2020

Upcoding for Product Codes

b0 b1 b2 b3 b4 hb1

b5 b6 b7 b8 b9 hb2

Pb0 Pb1 Pb2 Pb3 Pb4 Phb

a0 a1 a2 a3 a4 ha1

a5 a6 a7 a8 a9 ha2


c0 c1 c2 c3 c4 hc1

c5 c6 c7 c8 c9 hc2

Pc0 Pc1 Pc2 Pc3 Pc4 Phc

a0 a1 a2 a3 a4 ha1

a5 a6 a7 a8 a9 ha2

b0 b1 b2 b3 b4 hb1

b5 b6 b7 b8 b9 hb2

c0 c1 c2 c3 c4 hc1

c5 c6 c7 c8 c9 hc2

P0 P1 P2 P3 P4 Ph

Fast Code

PC(2x5)

Compact Code

PC(6x5)

• Parities h require no re-construction

• Parities P require no data block transfer

• All parities updates can be done in parallel

2121

Downcoding for Product Codes

b0 b1 b2 b3 b4 hb1

b5 b6 b7 b8 b9 hb2

Pb0 Pb1 Pb2 Pb3 Pb4 Phb

a0 a1 a2 a3 a4 ha1

a5 a6 a7 a8 a9 ha2


c0 c1 c2 c3 c4 hc1

c5 c6 c7 c8 c9 hc2

Pc0 Pc1 Pc2 Pc3 Pc4 Phc

a0 a1 a2 a3 a4 ha1

a5 a6 a7 a8 a9 ha2

b0 b1 b2 b3 b4 hb1

b5 b6 b7 b8 b9 hb2

c0 c1 c2 c3 c4 hc1

c5 c6 c7 c8 c9 hc2

P0 P1 P2 P3 P4 Ph

Compact Code

PC(6x5)

Fast Code

PC(2x5)

• Pa0=XOR(a0,a5)

• Pc0=XOR(P0,Pa0,Pb0)

22

Evaluation Platform

CPU: Intel Xeon E5645 24 cores, 2.4GHz

Disk: 7.2K RPM, 6*2TB

Memory: 96GB

Network: 1Gbps NIC

Cluster size: 11 nodes

Workload

22

CC: Cloudera Customer FB: Facebook

23

Evaluation Metrics

Degraded read latency

Foreground read request latency

Reconstruction time

Background recovery for failures

Storage overhead

23

24

The Production systems: 16-21 seconds

HACFS: 10-14 seconds

Degraded Read Latency

Bounded the storage overhead of HACFS LRC and PC to 1.4 and 1.5

25

A disk with 100GB data failed HACFS-PC takes about 10-35 minutes less than Production

systems

HACFS-LRC is worse than RS(6,3) in GFS v2 To reconstruction global parities, HACFS-LRC need to read 12

blocks, but GFS v2 only 6 blocks

Reconstruction Time

26

System Comparison

Colossus FS:RS(6,3)-1.5x

HDFS-Raid: RS(10,4)-1.4x

Azure: LRC(12,2,2)-1.33x

26

HACFS-PC:

PC(2x5)-1.8x

PC(6x5)-1.4x

HACFS-LRC:

LRC(12,6,2)-1.67x

LRC(12,2,2)-1.33x

27

System Comparison



Azure: LRC(12,2,2)-1.33x

27

HACFS-PC:

PC(2x5)-1.8x

PC(6x5)-1.4x

HACFS-LRC:

LRC(12,6,2)-1.67x

LRC(12,2,2)-1.33x

lost block type

HACFS-PC HACFS-LRC Colossus FS HDFS-RAID Azure

data blockfast: 2 fast: 2

6 10 6comp: 5 comp: 6

global parityfast: 5 fast: 12

6 10 12comp: 6 comp: 12

28

System Comparison



Azure: LRC(12,2,2)-1.33x

28

HACFS-PC:

PC(2x5)-1.8x

PC(6x5)-1.4x

HACFS-LRC:

LRC(12,6,2)-1.67x

LRC(12,2,2)-1.33x

lost block type

HACFS-PC HACFS-LRC Colossus FS HDFS-RAID Azure

data blockfast: 2 fast: 2

6 10 6comp: 5 comp: 6

global parityfast: 5 fast: 12

6 10 12comp: 6 comp: 12

29

Conclusions

By using Erasure code, a lot of storage space can be

saved.

The production systems using a single erasure code

can not balance the tradeoff between recovery cost

and storage overhead very well.

HACFS by using a dynamically adaptive coding can

provide both low recovery cost and storage overhead.

30

Related Work

f4 OSDI'14

Divide the cold and hot by the data age

XOR-based Erasure Code--FAST’12

Combination RS with XOR.

Minimum-Storage-Regeneration(MSR)

Minimizes network transfers during reconstruction.

Product-Matrix-Reconstruct-By-Transfer(PM-RBT)FAST’15

Optimal in terms of I/O, storage, and network bandwidth.

30

3131

Thank You!

32

Acknowledgment

Prof. Xiong

Zigang Zhang

Biao Ma

CAS– ICT – Storage System Group 32

a tale of two erasure codes in hdfs -...

Documents