acceleration for big data, hadoop and memcached it168文库

51
Acceleration for Big Data, Hadoop and Memcached Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda A Presentation at HPC Advisory Council Workshop, Lugano 2012 by

Upload: phs

Post on 29-Aug-2014

498 views

Category:

Documents


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Acceleration for big data, hadoop and memcached it168文库

Acceleration for Big Data, Hadoop and

Memcached

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

A Presentation at HPC Advisory Council Workshop, Lugano 2012

by

Page 2: Acceleration for big data, hadoop and memcached it168文库

• MPI is a dominant programming model for HPC Systems

• Introduced some of the MPI Features and their Usage

• Introduced MVAPICH2 stack

• Illustrated many performance optimizations and tuning techniques for

MVAPICH2

• Provided an overview of MPI-3 Features

• Introduced challenges in designing MPI for Exascale systems

• Presented approaches being taken by MVAPICH2 for Exascale systems

2

Recap of Last Two Day’s Presentations

HPC Advisory Council, Lugano Switzerland '12

Page 3: Acceleration for big data, hadoop and memcached it168文库

HPC Advisory Council, Lugano Switzerland '12 3

High-Performance Networks in the Top500

Percentage share of InfiniBand is steadily increasing

Page 4: Acceleration for big data, hadoop and memcached it168文库

• OpenFabrics software stack with IB, iWARP and RoCE

interfaces are driving HPC systems

• Message Passing Interface (MPI)

• Parallel File Systems

• Almost 11.5 years of Research and Development since

InfiniBand was introduced in October 2001

• Other Programming Models are emerging to take

advantage of High-Performance Networks

– UPC

– SHMEM

4

Use of High-Performance Networks for Scientific Computing

HPC Advisory Council, Lugano Switzerland '12

Page 5: Acceleration for big data, hadoop and memcached it168文库

HPC Advisory Council, Lugano Switzerland '12 5

One-way Latency: MPI over IB

0.00

1.00

2.00

3.00

4.00

5.00

6.00Small Message Latency

Message Size (bytes)

Late

ncy

(u

s)

1.66

1.56

1.64

1.82

0.81

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 without IB switch

0.00

50.00

100.00

150.00

200.00

250.00MVAPICH-Qlogic-DDR

MVAPICH-Qlogic-QDR

MVAPICH-ConnectX-DDR

MVAPICH-ConnectX2-PCIe2-QDR

MVAPICH-ConnectX3-PCIe3-FDR

Large Message Latency

Message Size (bytes)

Late

ncy

(u

s)

Page 6: Acceleration for big data, hadoop and memcached it168文库

HPC Advisory Council, Lugano Switzerland '12 6

Bandwidth: MPI over IB

0

1000

2000

3000

4000

5000

6000

7000Unidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)

Message Size (bytes)

3280

3385

1917

1706

6333

-1000

1000

3000

5000

7000

9000

11000

13000

15000 MVAPICH-Qlogic-DDR

MVAPICH-Qlogic-QDR

MVAPICH-ConnectX-DDR

MVAPICH-ConnectX2-PCIe2-QDR

MVAPICH-ConnectX3-PCIe3-FDR

Bidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)

Message Size (bytes)

3341

3704

4407

11043

6521

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 without IB switch

Page 7: Acceleration for big data, hadoop and memcached it168文库

• 209 IB Clusters (41.8%) in the November‘11 Top500 list

(http://www.top500.org)

• Installations in the Top 30 (13 systems):

HPC Advisory Council, Lugano Switzerland '12

Large-scale InfiniBand Installations

120,640 cores (Nebulae) in China (4th) 29,440 cores (Mole-8.5) in China (21st)

73,278 cores (Tsubame-2.0) in Japan (5th) 42,440 cores (Red Sky) at Sandia (24th)

111,104 cores (Pleiades) at NASA Ames (7th) 62,976 cores (Ranger) at TACC (25th)

138,368 cores (Tera-100) at France (9th) 20,480 cores (Bull Benchmarks) in France (27th)

122,400 cores (RoadRunner) at LANL (10th) 20,480 cores (Helios) in Japan (28th)

137,200 cores (Sunway Blue Light) in China (14th) More are getting installed !

46,208 cores (Zin) at LLNL (15th)

33,072 cores (Lomonosov) in Russia (18th)

7

Page 8: Acceleration for big data, hadoop and memcached it168文库

• Focuses on big data and data analytics

• Multiple environments and middleware are gaining

momentum

– Hadoop (HDFS, HBase and MapReduce)

– Memcached

HPC Advisory Council, Lugano Switzerland '12 8

Enterprise/Commercial Computing

Page 9: Acceleration for big data, hadoop and memcached it168文库

Can High-Performance Interconnects Benefit Enterprise Computing?

• Most of the current enterprise systems use 1GE

• Concerns for performance and scalability

• Usage of High-Performance Networks is beginning to draw interest

– Oracle, IBM, Google are working along these directions

• What are the challenges?

• Where do the bottlenecks lie?

• Can these bottlenecks be alleviated with new designs (similar to the designs adopted for MPI)?

9 HPC Advisory Council, Lugano Switzerland '12

Page 10: Acceleration for big data, hadoop and memcached it168文库

• Overview of Hadoop, Memcached and HBase

• Challenges in Accelerating Enterprise Middleware

• Designs and Case Studies

– Memcached

– HBase

– HDFS

• Conclusion and Q&A

Presentation Outline

10 HPC Advisory Council, Lugano Switzerland '12

Page 11: Acceleration for big data, hadoop and memcached it168文库

Memcached Architecture

• Integral part of Web 2.0 architecture

• Distributed Caching Layer

– Allows to aggregate spare memory from multiple nodes

– General purpose

• Typically used to cache database queries, results of API calls

• Scalable model, but typical usage very network intensive

11

!"#$%"$#&

Web Frontend Servers (Memcached Clients)

(Memcached Servers)

(Database Servers)

High

Performance

Networks

High

Performance

Networks

Main

memoryCPUs

SSD HDD

High Performance Networks

... ...

...

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD

HPC Advisory Council, Lugano Switzerland '12

Page 12: Acceleration for big data, hadoop and memcached it168文库

Hadoop Architecture

• Underlying Hadoop Distributed File System (HDFS)

• Fault-tolerance by replicating data blocks

• NameNode: stores information on data blocks

• DataNodes: store blocks and host Map-reduce computation

• JobTracker: track jobs and detect failure

• Model scales but high amount of communication during intermediate phases

12 HPC Advisory Council, Lugano Switzerland '12

Page 13: Acceleration for big data, hadoop and memcached it168文库

Network-Level Interaction Between Clients and Data Nodes in HDFS

13

High

Performance

Networks

(HDD/SSD)

(HDD/SSD)

(HDD/SSD)

...

...

(HDFS Data Nodes)(HDFS Clients)

...

...

HPC Advisory Council, Lugano Switzerland '12

Page 14: Acceleration for big data, hadoop and memcached it168文库

Overview of HBase Architecture

• An open source database project based on Hadoop framework for hosting very large tables

• Major components: HBaseMaster, HRegionServer and HBaseClient

• HBase and HDFS are deployed in the same cluster to get better data locality

14 HPC Advisory Council, Lugano Switzerland '12

Page 15: Acceleration for big data, hadoop and memcached it168文库

Network-Level Interaction Between HBase Clients, Region Servers and Data Nodes

15

(HBase Clients) (HRegion Servers) (Data Nodes)

(HDD/SSD)

(HDD/SSD)

(HDD/SSD)

...

......

... ...

...

High

Performance

Networks

High

Performance

Networks

HPC Advisory Council, Lugano Switzerland '12

Page 16: Acceleration for big data, hadoop and memcached it168文库

• Overview of Hadoop, Memcached and HBase

• Challenges in Accelerating Enterprise Middleware

• Designs and Case Studies

– Memcached

– HBase

– HDFS

• Conclusion and Q&A

Presentation Outline

16 HPC Advisory Council, Lugano Switzerland '12

Page 17: Acceleration for big data, hadoop and memcached it168文库

Designing Communication and I/O Libraries for Enterprise Systems: Challenges

HPC Advisory Council, Lugano Switzerland '12 17

D at ac e n t e r M i ddl e war e (H D F S , H Ba s e , M a pRe duc e , M e m c a c he d)

N e t w orki ng T e c hnol ogi e s (Infi ni Ba nd, 1/ 10/ 40 G i G E ,

RN ICs & Int e l l i ge nt N ICs )

S t ora ge T e c hnol ogi e s (H D D or S S D )

P oi nt -t o-P oi nt

Com m uni c a t i on

Q oS

T hre a di ng M ode l s a nd

S ync hroni z a t i on

F a ul t T ol e ra nc e I/ O a nd F i l e s ys t e m s

Com m u n i c at i on an d I / O L i br ar y

P r ogr am m i n g M ode l s (S oc ke t )

Appl i c at i on s

Com m odi t y Com put i ng S ys t e m

A rc hi t e c t ure s (s i ngl e , dua l , qua d, ..)

M ul t i / M a ny-c ore A rc hi t e c t ure

a nd A c c e l e ra t ors

Page 18: Acceleration for big data, hadoop and memcached it168文库

Common Protocols using Open Fabrics

18

Application

Verbs Sockets Application

Interface

SDP

RDMA

SDP

InfiniBand Adapter

InfiniBand Switch

RDMA

IB Verbs

InfiniBand Adapter

InfiniBand Switch

User

space

RDMA

RoCE

RoCE Adapter

User

space

Ethernet Switch

TCP/IP

Ethernet Driver

Kernel Space

Protocol Implementation

InfiniBand Adapter

InfiniBand Switch

IPoIB

Ethernet Adapter

Ethernet Switch

Network Adapter

Network Switch

1/10/40 GigE

iWARP

Ethernet Switch

iWARP

iWARP Adapter

User

space IPoIB

TCP/IP

Ethernet Adapter

Ethernet Switch

10/40 GigE-TOE

Hardware Offload

HPC Advisory Council, Lugano Switzerland '12

Page 19: Acceleration for big data, hadoop and memcached it168文库

Can New Data Analysis and Management Systems be Designed with High-Performance Networks and Protocols?

19

Enhanced Designs

Application

Accelerated Sockets

10 GigE or InfiniBand

Verbs / Hardware Offload

Current Design

Application

Sockets

1/10 GigE Network

• Sockets not designed for high-performance

– Stream semantics often mismatch for upper layers (Memcached, HBase, Hadoop)

– Zero-copy not available for non-blocking sockets

Our Approach

Application

OSU Design

10 GigE or InfiniBand

Verbs Interface

HPC Advisory Council, Lugano Switzerland '12

Page 20: Acceleration for big data, hadoop and memcached it168文库

Interplay between Storage and Interconnect/Protocols

• Most of the current generation enterprise systems use the

traditional hard disks

• Since hard disks are slower, high performance

communication protocols may not have impact

• SSDs and other storage technologies are emerging

• Does it change the landscape?

20 HPC Advisory Council, Lugano Switzerland '12

Page 21: Acceleration for big data, hadoop and memcached it168文库

• Overview of Hadoop, Memcached and HBase

• Challenges in Accelerating Enterprise Middleware

• Designs and Case Studies

– Memcached

– HBase

– HDFS

• Conclusion and Q&A

Presentation Outline

21 HPC Advisory Council, Lugano Switzerland '12

Page 22: Acceleration for big data, hadoop and memcached it168文库

Memcached Design Using Verbs

22

• Server and client perform a negotiation protocol

– Master thread assigns clients to appropriate worker thread

• Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to that thread, each verbs worker thread can support multiple clients

• All other Memcached data structures are shared among RDMA and Sockets worker threads

• Memcached applications need not be modified; uses verbs interface if available

• Memcached Server can serve both socket and verbs clients simultaneously

Sockets Client

RDMA Client

Master Thread

Sockets Worker Thread

Verbs Worker Thread

Sockets Worker Thread

Verbs Worker Thread

Shared

Data

Memory Slabs Items

1

1

2

2

HPC Advisory Council, Lugano Switzerland '12

Page 23: Acceleration for big data, hadoop and memcached it168文库

Experimental Setup

• Hardware

– Intel Clovertown

• Each node has 8 processor cores on 2 Intel Xeon 2.33 GHz Quad-core CPUs,

6 GB main memory, 250 GB hard disk

• Network: 1GigE, IPoIB, 10GigE TOE and IB (DDR)

– Intel Westmere

• Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz Quad-core CPUs,

12 GB main memory, 160 GB hard disk

• Network: 1GigE, IPoIB, and IB (QDR)

• Software

– Memcached Server: 1.4.9

– Memcached Client: (libmemcached) 0.52

– In all experiments, ‘memtable’ is contained in memory (no disk

access involved)

23 HPC Advisory Council, Lugano Switzerland '12

Page 24: Acceleration for big data, hadoop and memcached it168文库

24

• Memcached Get latency

– 4 bytes RC/UD – DDR: 6.82/7.55 us; QDR: 4.28/4.86 us

– 2K bytes RC/UD – DDR: 12.31/12.78 us; QDR: 8.19/8.46 us

• Almost factor of four improvement over 10GE (TOE) for 2K bytes on

the DDR cluster

Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)

0

20

40

60

80

100

120

140

160

180

1 2 4 8 16 32 64 128 256 512 1K 2K

Tim

e (

us)

Message Size

0

20

40

60

80

100

120

140

160

180

1 2 4 8 16 32 64 128 256 512 1K 2K

Tim

e (

us)

Message Size

SDP IPoIB

OSU-RC-IB 1GigE

10GigE OSU-UD-IB

Memcached Get Latency (Small Message)

HPC Advisory Council, Lugano Switzerland '12

Page 25: Acceleration for big data, hadoop and memcached it168文库

HPC Advisory Council, Lugano Switzerland '12 25

Memcached Get Latency (Large Message)

• Memcached Get latency

– 8K bytes RC/UD – DDR: 18.9/19.1 us; QDR: 11.8/12.2 us

– 512K bytes RC/UD -- DDR: 369/403 us; QDR: 173/203 us

• Almost factor of two improvement over 10GE (TOE) for 512K bytes on

the DDR cluster

Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)

0

1000

2000

3000

4000

5000

6000

2K 4K 8K 16K 32K 64K 128K 256K 512K

Tim

e (

us)

Message Size

0

1000

2000

3000

4000

5000

6000

2K 4K 8K 16K 32K 64K 128K 256K 512K

Tim

e (

us)

Message Size

SDP IPoIB

OSU-RC-IB 1GigE

10GigE OSU-UD-IB

Page 26: Acceleration for big data, hadoop and memcached it168文库

HPC Advisory Council, Lugano Switzerland '12 26

Memcached Get TPS (4byte)

• Memcached Get transactions per second for 4 bytes

– On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients

• Significant improvement with native IB QDR compared to SDP and IPoIB

0

200

400

600

800

1000

1200

1400

1600

1 2 4 8 16 32 64 128 256 512 800 1K

Tho

usa

nd

s o

f Tr

ansa

ctio

ns

pe

r se

con

d (

TPS)

No. of Clients

SDP IPoIB

OSU-RC-IB 1GigE

OSU-UD-IB

0

200

400

600

800

1000

1200

1400

1600

4 8

Tho

usa

nd

s o

f Tr

ansa

ctio

ns

pe

r se

con

d (

TPS)

No. of Clients

Page 27: Acceleration for big data, hadoop and memcached it168文库

HPC Advisory Council, Lugano Switzerland '12 27

Memcached - Memory Scalability

• Steady Memory Footprint for UD Design

– ~ 200MB

• RC Memory Footprint increases as increase in number of clients

– ~500MB for 4K clients

0

100

200

300

400

500

600

700

1 2 4 8 16 32 64 128 256 512 800 1K 1.6K 2K 4K

Me

mo

ry F

oo

tpri

nt

(MB

)

No. of Clients

SDP IPoIB

OSU-RC-IB 1GigE

OSU-UD-IB OSU-Hybrid-IB

Page 28: Acceleration for big data, hadoop and memcached it168文库

HPC Advisory Council, Lugano Switzerland '12 28

Application Level Evaluation – Olio Benchmark

• Olio Benchmark

– RC – 1.6 sec, UD – 1.9 sec, Hybrid – 1.7 sec for 1024 clients

• 4X times better than IPoIB for 8 clients

• Hybrid design achieves comparable performance to that of pure RC design

0

500

1000

1500

2000

2500

64 128 256 512 1024

Tim

e (

ms)

No. of Clients

0

20

40

60

80

100

120

1 4 8

Tim

e (

ms)

No. of Clients

SDP

IPoIB

OSU-RC-IB

OSU-UD-IB

OSU-Hybrid-IB

Page 29: Acceleration for big data, hadoop and memcached it168文库

HPC Advisory Council, Lugano Switzerland '12 29

Application Level Evaluation – Real Application Workloads

• Real Application Workload – RC – 302 ms, UD – 318 ms, Hybrid – 314 ms for 1024 clients

• 12X times better than IPoIB for 8 clients • Hybrid design achieves comparable performance to that of pure RC design

0

20

40

60

80

100

120

1 4 8

Tim

e (

ms)

No. of Clients

SDP

IPoIB

OSU-RC-IB

OSU-UD-IB

OSU-Hybrid-IB

0

50

100

150

200

250

300

350

64 128 256 512 1024

Tim

e (

ms)

No. of Clients

J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K.

Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11

J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Memcached Design on

High Performance RDMA Capable Interconnects, CCGrid’12

Page 30: Acceleration for big data, hadoop and memcached it168文库

• Overview of Hadoop, Memcached and HBase

• Challenges in Accelerating Enterprise Middleware

• Designs and Case Studies

– Memcached

– HBase

– HDFS

• Conclusion and Q&A

Presentation Outline

30 HPC Advisory Council, Lugano Switzerland '12

Page 31: Acceleration for big data, hadoop and memcached it168文库

HBase Design Using Verbs

31

OSU Design

HBase

OSU Module

JNI Interface

InfiniBand (Verbs)

Current Design

HBase

Sockets

1/10 GigE Network

HPC Advisory Council, Lugano Switzerland '12

Page 32: Acceleration for big data, hadoop and memcached it168文库

Experimental Setup

• Hardware

– Intel Clovertown

• Each node has 8 processor cores on 2 Intel Xeon 2.33 GHz Quad-core CPUs,

6 GB main memory, 250 GB hard disk

• Network: 1GigE, IPoIB, 10GigE TOE and IB (DDR)

– Intel Westmere

• Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz Quad-core CPUs,

12 GB main memory, 160 GB hard disk

• Network: 1GigE, IPoIB, and IB (QDR)

– 3 Nodes used

• Node1 [NameNode & HBase Master]

• Node2 [DataNode & HBase RegionServer]

• Node3 [Client]

• Software

– Hadoop 0.20.0, HBase 0.90.3 and Sun Java SDK 1.7.

– In all experiments, ‘memtable’ is contained in memory (no disk access

involved)

32 HPC Advisory Council, Lugano Switzerland '12

Page 33: Acceleration for big data, hadoop and memcached it168文库

Details on Experiments

• Key/Value size

– Key size: 20 Bytes

– Value size: 1KB/4KB

• Get operation

– One Key/Value pair is inserted, so that Key/Value pair will stay in

memory

– Get operation is repeated 80,000 times

– Skipped first 40, 000 iterations as warm-up

• Put operation

– Memstore_Flush_Size is set to be 256 MB

– No memory flush operation involved

– Put operation is repeated 40, 000 times

– Skipped first 10, 000 iterations as warm-up

33 HPC Advisory Council, Lugano Switzerland '12

Page 34: Acceleration for big data, hadoop and memcached it168文库

Get Operation (IB:DDR)

34

0

50

100

150

200

250

300

1K 4K

Tim

e (

us)

Message Size

1GE IPoIB10GE OSU Design

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

1K 4KO

pe

rati

on

s /s

ec

Message Size

Throughput Latency

• HBase Get Operation

– 1K bytes – 65 us (15K TPS)

– 4K bytes -- 88 us (11K TPS)

• Almost factor of two improvement over 10GE (TOE)

HPC Advisory Council, Lugano Switzerland '12

Page 35: Acceleration for big data, hadoop and memcached it168文库

Get Operation (IB:QDR)

35

Throughput Latency

• HBase Get Operation

– 1K bytes – 47 us (22K TPS)

– 4K bytes -- 64 us (16K TPS)

• Almost factor of four improvement over IPoIB for 1KB

0

50

100

150

200

250

300

350

1K 4K

Tim

e (

us)

Message Size

1GE

IPoIB

OSU Design

0

5000

10000

15000

20000

25000

1K 4K

Op

era

tio

ns

/se

c

Message Size

HPC Advisory Council, Lugano Switzerland '12

Page 36: Acceleration for big data, hadoop and memcached it168文库

Put Operation (IB:DDR)

36

Throughput Latency

0

50

100

150

200

250

300

350

400

1K 4K

Tim

e (

us)

Message Size

1GE IPoIB 10GE OSU Design

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

1K 4K

Op

era

tio

ns

/se

c Message Size

• HBase Put Operation

– 1K bytes – 114 us (8.7K TPS)

– 4K bytes -- 179 us (5.6K TPS)

• 34% improvement over 10GE (TOE) for 1KB

HPC Advisory Council, Lugano Switzerland '12

Page 37: Acceleration for big data, hadoop and memcached it168文库

Put Operation (IB:QDR)

37

Throughput Latency

0

50

100

150

200

250

300

350

400

1K 4K

Tim

e (

us)

Message Size

1GE

IPoIB

OSU Design

0

2000

4000

6000

8000

10000

12000

14000

1K 4K

Op

era

tio

ns

/se

c Message Size

• HBase Put Operation

– 1K bytes – 78 us (13K TPS)

– 4K bytes -- 122 us (8K TPS)

• A factor of two improvement over IPoIB for 1KB

HPC Advisory Council, Lugano Switzerland '12

Page 38: Acceleration for big data, hadoop and memcached it168文库

HPC Advisory Council, Lugano Switzerland '12 38

HBase Put/Get – Detailed Analysis

• HBase 1KB Put – Communication Time – 8.9 us – A factor of 6X improvement over 10GE for communication time

• HBase 1KB Get – Communication Time – 8.9 us – A factor of 6X improvement over 10GE for communication time

0

50

100

150

200

250

300

1GigE IPoIB 10GigE OSU-IB

Tim

e (

us)

HBase Put 1KB

0

50

100

150

200

250

1GigE IPoIB 10GigE OSU-IB

Tim

e (

us)

HBase Get 1KB

Communication

Communication Preparation

Server Processing

Server Serialization

Client Processing

Client Serialization

W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda, Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?, ISPASS’12

Page 39: Acceleration for big data, hadoop and memcached it168文库

HPC Advisory Council, Lugano Switzerland '12 39

HBase Single Server-Multi-Client Results

• HBase Get latency

– 4 clients: 104.5 us; 16 clients: 296.1 us

• HBase Get throughput

– 4 clients: 37.01 Kops/sec; 16 clients: 53.4 Kops/sec

• 27% improvement in throughput for 16 clients over 10GE

0

100

200

300

400

500

600

1 2 4 8 16

Tim

e (

us)

No. of Clients

0

10000

20000

30000

40000

50000

60000

1 2 4 8 16

Op

s/se

c

No. of Clients

IPoIB

OSU-IB

1GigE

10GigE

Latency Throughput

Page 40: Acceleration for big data, hadoop and memcached it168文库

HPC Advisory Council, Lugano Switzerland '12 40

HBase – YCSB Read-Write Workload

• HBase Get latency (Yahoo! Cloud Service Benchmark)

– 64 clients: 2.0 ms; 128 Clients: 3.5 ms

– 42% improvement over IPoIB for 128 clients

• HBase Get latency

– 64 clients: 1.9 ms; 128 Clients: 3.5 ms

– 40% improvement over IPoIB for 128 clients

0

1000

2000

3000

4000

5000

6000

7000

8 16 32 64 96 128

Tim

e (

us)

No. of Clients

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

8 16 32 64 96 128

Tim

e (

us)

No. of Clients

IPoIB OSU-IB

1GigE 10GigE

Read Latency Write Latency

J. Huang, X. Ouyang, J. Jose, W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, IPDPS’12

Page 41: Acceleration for big data, hadoop and memcached it168文库

• Overview of Hadoop, Memcached and HBase

• Challenges in Accelerating Enterprise Middleware

• Designs and Case Studies

– Memcached

– HBase

– HDFS

• Conclusion and Q&A

Presentation Outline

41 HPC Advisory Council, Lugano Switzerland '12

Page 42: Acceleration for big data, hadoop and memcached it168文库

• Two Kinds of Designs and Studies we have Done

– Studying the impact of HDD vs. SSD for HDFS

• Unmodified Hadoop for experiments

– Preliminary design of HDFS over Verbs

• Hadoop Experiments

– Intel Clovertown 2.33GHz, 6GB RAM, InfiniBand DDR, Chelsio T320

– Intel X-25E 64GB SSD and 250GB HDD

– Hadoop version 0.20.2, Sun/Oracle Java 1.6.0

– Dedicated NameServer and JobTracker

– Number of Datanodes used: 2, 4, and 8

42

HPC Advisory Council, Lugano Switzerland '12

Studies and Experimental Setup

Page 43: Acceleration for big data, hadoop and memcached it168文库

Hadoop: DFS IO Write Performance

43

• DFS IO included in Hadoop, measures sequential access throughput

• We have two map tasks each writing to a file of increasing size (1-10GB)

• Significant improvement with IPoIB, SDP and 10GigE

• With SSD, performance improvement is almost seven or eight fold!

• SSD benefits not seen without using high-performance interconnect

Four Data Nodes Using HDD and SSD

0

10

20

30

40

50

60

70

80

90

1 2 3 4 5 6 7 8 9 10

Av

era

ge W

rite

Th

rou

gh

pu

t (M

B/s

ec)

File Size(GB)

1GE with HDD

IGE with SSD

IPoIB with HDD

IPoIB with SSD

SDP with HDD

SDP with SSD

10GE-TOE with HDD

10GE-TOE with SSD

HPC Advisory Council, Lugano Switzerland '12

Page 44: Acceleration for big data, hadoop and memcached it168文库

Hadoop: RandomWriter Performance

• Each map generates 1GB of random binary data and writes to HDFS

• SSD improves execution time by 50% with 1GigE for two DataNodes

• For four DataNodes, benefits are observed only with HPC interconnect

• IPoIB, SDP and 10GigE can improve performance by 59% on four Data Nodes

44

0

100

200

300

400

500

600

700

2 4

Execu

tio

n T

ime (

sec)

Number of data nodes

1GE with HDD

IGE with SSD

IPoIB with HDD

IPoIB with SSD

SDP with HDD

SDP with SSD

10GE-TOE with HDD

10GE-TOE with SSD

HPC Advisory Council, Lugano Switzerland '12

Page 45: Acceleration for big data, hadoop and memcached it168文库

Hadoop Sort Benchmark

45

• Sort: baseline benchmark for Hadoop

• Sort phase: I/O bound; Reduce phase: communication bound

• SSD improves performance by 28% using 1GigE with two DataNodes

• Benefit of 50% on four DataNodes using SDP, IPoIB or 10GigE

0

500

1000

1500

2000

2500

2 4

Execu

tio

n T

ime (

sec)

Number of data nodes

1GE with HDD

IGE with SSD

IPoIB with HDD

IPoIB with SSD

SDP with HDD

SDP with SSD

10GE-TOE with HDD

10GE-TOE with SSD

HPC Advisory Council, Lugano Switzerland '12

S. Sur, H. Wang, J. Huang, X. Ouyang and D. K. Panda “Can High-Performance Interconnects Benefit Hadoop Distributed File System?”, MASVDC ‘10 in conjunction with MICRO 2010, Atlanta, GA.

Page 46: Acceleration for big data, hadoop and memcached it168文库

HDFS Design Using Verbs

46

OSU Design

HDFS

OSU Module

JNI Interface

InfiniBand (Verbs)

Current Design

HDFS

Sockets

1/10 GigE Network

HPC Advisory Council, Lugano Switzerland '12

Page 47: Acceleration for big data, hadoop and memcached it168文库

HPC Advisory Council, Lugano Switzerland '12 47

RDMA-based Design for Native HDFS – Preliminary Results

• HDFS File Write Experiment using four data nodes on IB-DDR Cluster

• HDFS File Write Time

– 2 GB – 14 s, 5 GB – 86s,

– For 5 GB File Size - 20% improvement over IPoIB,

14% improvement over 10GigE

0

20

40

60

80

100

120

1 2 3 4 5

Tim

e (

ms)

File Size (GB)

1 GigE IPoIB

10 GigE OSU-Design

Page 48: Acceleration for big data, hadoop and memcached it168文库

• Overview of Hadoop, Memcached and HBase

• Challenges in Accelerating Enterprise Middleware

• Designs and Case Studies

– Memcached

– HBase

– HDFS

• Conclusion and Q&A

Presentation Outline

48 HPC Advisory Council, Lugano Switzerland '12

Page 49: Acceleration for big data, hadoop and memcached it168文库

• InfiniBand with RDMA feature is gaining momentum in HPC

systems with best performance and greater usage

• It is possible to use the RDMA feature in enterprise environments

for accelerating big data processing

• Presented some initial designs and performance numbers

• Many open research challenges remain to be solved so that

middleware for enterprise environments can take advantage of

– modern high-performance networks

– multi-core technologies

– emerging storage technologies

49

Concluding Remarks

HPC Advisory Council, Lugano Switzerland '12

Page 50: Acceleration for big data, hadoop and memcached it168文库

Designing Communication and I/O Libraries for Enterprise Systems: Solved a Few Initial Challenges

HPC Advisory Council, Lugano Switzerland '12 50

D at ac e n t e r M i ddl e war e (H D F S , H Ba s e , M a pRe duc e , M e m c a c he d)

N e t w orki ng T e c hnol ogi e s (Infi ni Ba nd, 1/ 10/ 40 G i G E ,

RN ICs & Int e l l i ge nt N ICs )

S t ora ge T e c hnol ogi e s (H D D or S S D )

P oi nt -t o-P oi nt

Com m uni c a t i on

Q oS

T hre a di ng M ode l s a nd

S ync hroni z a t i on

F a ul t T ol e ra nc e I/ O a nd F i l e s ys t e m s

Com m u n i c at i on an d I / O L i br ar y

P r ogr am m i n g M ode l s (S oc ke t )

Appl i c at i on s

Com m odi t y Com put i ng S ys t e m

A rc hi t e c t ure s (s i ngl e , dua l , qua d, ..)

M ul t i / M a ny-c ore A rc hi t e c t ure

a nd A c c e l e ra t ors