a tale of two erasure codes in hdfs -...
TRANSCRIPT
Dynamo
A Tale of Two Erasure Codes in HDFS
1
Mingyuan Xia*, Mohit Saxena+ ,
Mario Blaum+, and David A. Pease+
*McGill University, +IBM Research Almaden
FAST’15
何军权 2015-04-30
22
Outline
Introduction & Motivation
Design
Evaluation
Conclustions
Related work
33
Introduction & Motivation
4
Big Data Storage
Reliability and Availability
Replication: 3-way replication
Erasure Code: Reed-Solomon(RS), LRC
4
GFS3-way replication
3x, 2003
FB HDFSRS, 1.4x, 2011
GFS v2RS, 1.5x, 2012
AzureLRC, 1.33x, 2012
FB HDFSLRC, 1.66x, 2013
5
Popular Erasure Code Families
Product Code(PC)
Local Reconstruction Code(LRC)
Other
5
a0 a1 a2 a3 a4 a5 G1
a6 a7 a8 a9 a10 a11 G2
L0 L1 L2 L3 L4 L5
PC LRC
Reed-Solomon(RS)
a0 a1 a2 a3 a4 ha
b0 b1 b2 b3 b4 hb
P0 P1 P2 P3 P4 h
6
Erasure Code
Facebook HDFS RS(10,4)
Compute 4 parities per 10 data blocks
All blocks store in different storage nodes
Storage Overhead: 1.4x
D10
D1 D2 D3 D4 D5
D6 D7 D8 D9
P1 P2 P3 P4
7
Erasure Code
High Degraded Read Latency
Read to an unavailable block requires
Multiple disk reads, network transfers and compute cycles to
decode
…HDFS
Read
exception
Client
8
Erasure Code
Long Reconstruction Time Facebook's Cluster:
100K blocks lost per day
50 machine-unavailablility events per day
Reconstruction traffic: 180TB per day
…HDFS
Reconstruction Job
9
Erasure Code
Degraded
Read Latency
Recover Cost
Recover Cost: the total number of blocks required to reconstruction a data block after failure
Reconstruction Time
10
Recovery Cost vs. Storage Overhead
Conclusion
Storage Overhead and Reconstruction Cost are a tradeoff in
single erasure code.
FB HDFS RS
GFS v2 RS
Azure LRC
FB HDFS LRC GFS 3-way Repl
1111
How to balance?
Storage Overhead Recovery Cost
12
Data Access Skew
Conclusions Only few data are "hot"
P(freq > 10) ~= 1%
Most data are "cold" P(freq <= 10) ~= 99%
12
13
Data Access Skew
Hot data
High access frequency
A small fraction of data
Cold data
Low access frequency
A major fraction of data
13
A little improvement on read can
gain a high read performance
A few less of data to store can
save huge storage space
Hot Data: Decrease the Recovery Cost
Cold Data: High Storage Efficiency
14
HACFS
System State
Tracks file states
File size, last mTime
Read count and coding state
Adapting Coding
Tracks system states
Choose coding scheme
based on read count and
mTime
Erasure Coding
Providing four coding
interfaces
Encode/Decode
Upcode/Downcode
15
Erasure Coding Algorithms
Two different erasure codes
Fast code:
Encode the frequently accessed blocks to reduce the read latency
and reconstruction time
Provide overall low recovery cost
Compact code:
Encode the less frequently accessed blocks to get low storage
overhead
Maintain a low and bounded storage overhead
15
16
State Transition
3-way
replication
Fast
Code
Compact
Code
Recently
created
HACFS
Write cold
COND'
COND
COND
COND : Read Hot and Bounded
COND': Read Cold or Not Bounded
COND'
17
Fast and Compact Product Codes(1)
17
a0 a1 a2 a3 a4 ha1
a5 a6 a7 a8 a9 ha2
Pa0 Pa1 Pa2 Pa3 Pa4 Pha
Fast Code
(Product Code 2x5)
Storage overhead: 1.8x
Recovery Cost: 2
a0 a1 a2 a3 a4 ha1
a5 a6 a7 a8 a9 ha2
b0 b1 b2 b3 b4 hb1
b5 b6 b7 b8 b9 hb2
c0 c1 c2 c3 c4 hc1
c5 c6 c7 c8 c9 hc2
P0 P1 P2 P3 P4 Ph
Compact Code
(Product Code 6x5)
Storage overhead: 1.4x
• ha1=RS(a0,a1,a2,a3,a4)
• Pa0=XOR(a0,a5)
18
Fast and Compact Product Codes(2)
18
a0 a1 a2 a3 a4 ha1
a5 a6 a7 a8 a9 ha2
Pa0 Pa1 Pa2 Pa3 Pa4 Pha
Fast Code
(Product Code 2x5)
Storage overhead: 1.8x
Recovery Cost: 2
a0 a1 a2 a3 a4 ha1
a5 a6 a7 a8 a9 ha2
b0 b1 b2 b3 b4 hb1
b5 b6 b7 b8 b9 hb2
c0 c1 c2 c3 c4 hc1
c5 c6 c7 c8 c9 hc2
P0 P1 P2 P3 P4 Ph
Compact Code
(Product Code 6x5)
Storage overhead: 1.4x
Recovery Cost: 5
• P0=XOR(a0,a5,b0,b5,c0,c5)
• ha1=RS(a0,a1,a2,a3,a4)
• Pa0=XOR(a0,a5)
19
Fast and Compact LRC(1)
19
a0 a1 a2 a3 a4 a5 G1
a6 a7 a8 a9 a10 a11 G2
L0 L1 L2 L3 L4 L5
Fast Code
(LRC(12,6,2))
Storage overhead: 20/12=1.67x
a0 a1 a2 a3 a4 a5 G1
a6 a7 a8 a9 a10 a11 G2
L0 L1
Compact Code
(LRC(12,2,2))
Storage overhead: 16/12=1.33x
Recovery Cost: 2 Recovery Cost: 6
{G1,G2}=RS(a0,a1,..,a11)
Li=XOR(ai, ai+6){G1,G2}=RS(a0,a1,..,a11)
Li=RS'(a0, a1, a2, a6, a7, a8)
2020
Upcoding for Product Codes
b0 b1 b2 b3 b4 hb1
b5 b6 b7 b8 b9 hb2
Pb0 Pb1 Pb2 Pb3 Pb4 Phb
a0 a1 a2 a3 a4 ha1
a5 a6 a7 a8 a9 ha2
Pa0 Pa1 Pa2 Pa3 Pa4 Pha
c0 c1 c2 c3 c4 hc1
c5 c6 c7 c8 c9 hc2
Pc0 Pc1 Pc2 Pc3 Pc4 Phc
a0 a1 a2 a3 a4 ha1
a5 a6 a7 a8 a9 ha2
b0 b1 b2 b3 b4 hb1
b5 b6 b7 b8 b9 hb2
c0 c1 c2 c3 c4 hc1
c5 c6 c7 c8 c9 hc2
P0 P1 P2 P3 P4 Ph
Fast Code
PC(2x5)
Compact Code
PC(6x5)
• Parities h require no re-construction
• Parities P require no data block transfer
• All parities updates can be done in parallel
2121
Downcoding for Product Codes
b0 b1 b2 b3 b4 hb1
b5 b6 b7 b8 b9 hb2
Pb0 Pb1 Pb2 Pb3 Pb4 Phb
a0 a1 a2 a3 a4 ha1
a5 a6 a7 a8 a9 ha2
Pa0 Pa1 Pa2 Pa3 Pa4 Pha
c0 c1 c2 c3 c4 hc1
c5 c6 c7 c8 c9 hc2
Pc0 Pc1 Pc2 Pc3 Pc4 Phc
a0 a1 a2 a3 a4 ha1
a5 a6 a7 a8 a9 ha2
b0 b1 b2 b3 b4 hb1
b5 b6 b7 b8 b9 hb2
c0 c1 c2 c3 c4 hc1
c5 c6 c7 c8 c9 hc2
P0 P1 P2 P3 P4 Ph
Compact Code
PC(6x5)
Fast Code
PC(2x5)
• Pa0=XOR(a0,a5)
• Pc0=XOR(P0,Pa0,Pb0)
22
Evaluation Platform
CPU: Intel Xeon E5645 24 cores, 2.4GHz
Disk: 7.2K RPM, 6*2TB
Memory: 96GB
Network: 1Gbps NIC
Cluster size: 11 nodes
Workload
22
CC: Cloudera Customer FB: Facebook
23
Evaluation Metrics
Degraded read latency
Foreground read request latency
Reconstruction time
Background recovery for failures
Storage overhead
23
24
The Production systems: 16-21 seconds
HACFS: 10-14 seconds
Degraded Read Latency
Bounded the storage overhead of HACFS LRC and PC to 1.4 and 1.5
25
A disk with 100GB data failed HACFS-PC takes about 10-35 minutes less than Production
systems
HACFS-LRC is worse than RS(6,3) in GFS v2 To reconstruction global parities, HACFS-LRC need to read 12
blocks, but GFS v2 only 6 blocks
Reconstruction Time
26
System Comparison
Colossus FS:RS(6,3)-1.5x
HDFS-Raid: RS(10,4)-1.4x
Azure: LRC(12,2,2)-1.33x
26
HACFS-PC:
PC(2x5)-1.8x
PC(6x5)-1.4x
HACFS-LRC:
LRC(12,6,2)-1.67x
LRC(12,2,2)-1.33x
27
System Comparison
Colossus FS:RS(6,3)-1.5x
HDFS-Raid: RS(10,4)-1.4x
Azure: LRC(12,2,2)-1.33x
27
HACFS-PC:
PC(2x5)-1.8x
PC(6x5)-1.4x
HACFS-LRC:
LRC(12,6,2)-1.67x
LRC(12,2,2)-1.33x
lost block type
HACFS-PC HACFS-LRC Colossus FS HDFS-RAID Azure
data blockfast: 2 fast: 2
6 10 6comp: 5 comp: 6
global parityfast: 5 fast: 12
6 10 12comp: 6 comp: 12
28
System Comparison
Colossus FS:RS(6,3)-1.5x
HDFS-Raid: RS(10,4)-1.4x
Azure: LRC(12,2,2)-1.33x
28
HACFS-PC:
PC(2x5)-1.8x
PC(6x5)-1.4x
HACFS-LRC:
LRC(12,6,2)-1.67x
LRC(12,2,2)-1.33x
lost block type
HACFS-PC HACFS-LRC Colossus FS HDFS-RAID Azure
data blockfast: 2 fast: 2
6 10 6comp: 5 comp: 6
global parityfast: 5 fast: 12
6 10 12comp: 6 comp: 12
29
Conclusions
By using Erasure code, a lot of storage space can be
saved.
The production systems using a single erasure code
can not balance the tradeoff between recovery cost
and storage overhead very well.
HACFS by using a dynamically adaptive coding can
provide both low recovery cost and storage overhead.
30
Related Work
f4 OSDI'14
Divide the cold and hot by the data age
XOR-based Erasure Code--FAST’12
Combination RS with XOR.
Minimum-Storage-Regeneration(MSR)
Minimizes network transfers during reconstruction.
Product-Matrix-Reconstruct-By-Transfer(PM-RBT)FAST’15
Optimal in terms of I/O, storage, and network bandwidth.
30
3131
Thank You!
32
Acknowledgment
Prof. Xiong
Zigang Zhang
Biao Ma
CAS– ICT – Storage System Group 32