acceleration for big data, hadoop and memcached it168文库
TRANSCRIPT
Acceleration for Big Data, Hadoop and
Memcached
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
A Presentation at HPC Advisory Council Workshop, Lugano 2012
by
• MPI is a dominant programming model for HPC Systems
• Introduced some of the MPI Features and their Usage
• Introduced MVAPICH2 stack
• Illustrated many performance optimizations and tuning techniques for
MVAPICH2
• Provided an overview of MPI-3 Features
• Introduced challenges in designing MPI for Exascale systems
• Presented approaches being taken by MVAPICH2 for Exascale systems
2
Recap of Last Two Day’s Presentations
HPC Advisory Council, Lugano Switzerland '12
HPC Advisory Council, Lugano Switzerland '12 3
High-Performance Networks in the Top500
Percentage share of InfiniBand is steadily increasing
• OpenFabrics software stack with IB, iWARP and RoCE
interfaces are driving HPC systems
• Message Passing Interface (MPI)
• Parallel File Systems
• Almost 11.5 years of Research and Development since
InfiniBand was introduced in October 2001
• Other Programming Models are emerging to take
advantage of High-Performance Networks
– UPC
– SHMEM
4
Use of High-Performance Networks for Scientific Computing
HPC Advisory Council, Lugano Switzerland '12
HPC Advisory Council, Lugano Switzerland '12 5
One-way Latency: MPI over IB
0.00
1.00
2.00
3.00
4.00
5.00
6.00Small Message Latency
Message Size (bytes)
Late
ncy
(u
s)
1.66
1.56
1.64
1.82
0.81
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 without IB switch
0.00
50.00
100.00
150.00
200.00
250.00MVAPICH-Qlogic-DDR
MVAPICH-Qlogic-QDR
MVAPICH-ConnectX-DDR
MVAPICH-ConnectX2-PCIe2-QDR
MVAPICH-ConnectX3-PCIe3-FDR
Large Message Latency
Message Size (bytes)
Late
ncy
(u
s)
HPC Advisory Council, Lugano Switzerland '12 6
Bandwidth: MPI over IB
0
1000
2000
3000
4000
5000
6000
7000Unidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/
sec)
Message Size (bytes)
3280
3385
1917
1706
6333
-1000
1000
3000
5000
7000
9000
11000
13000
15000 MVAPICH-Qlogic-DDR
MVAPICH-Qlogic-QDR
MVAPICH-ConnectX-DDR
MVAPICH-ConnectX2-PCIe2-QDR
MVAPICH-ConnectX3-PCIe3-FDR
Bidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/
sec)
Message Size (bytes)
3341
3704
4407
11043
6521
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 without IB switch
• 209 IB Clusters (41.8%) in the November‘11 Top500 list
(http://www.top500.org)
• Installations in the Top 30 (13 systems):
HPC Advisory Council, Lugano Switzerland '12
Large-scale InfiniBand Installations
120,640 cores (Nebulae) in China (4th) 29,440 cores (Mole-8.5) in China (21st)
73,278 cores (Tsubame-2.0) in Japan (5th) 42,440 cores (Red Sky) at Sandia (24th)
111,104 cores (Pleiades) at NASA Ames (7th) 62,976 cores (Ranger) at TACC (25th)
138,368 cores (Tera-100) at France (9th) 20,480 cores (Bull Benchmarks) in France (27th)
122,400 cores (RoadRunner) at LANL (10th) 20,480 cores (Helios) in Japan (28th)
137,200 cores (Sunway Blue Light) in China (14th) More are getting installed !
46,208 cores (Zin) at LLNL (15th)
33,072 cores (Lomonosov) in Russia (18th)
7
• Focuses on big data and data analytics
• Multiple environments and middleware are gaining
momentum
– Hadoop (HDFS, HBase and MapReduce)
– Memcached
HPC Advisory Council, Lugano Switzerland '12 8
Enterprise/Commercial Computing
Can High-Performance Interconnects Benefit Enterprise Computing?
• Most of the current enterprise systems use 1GE
• Concerns for performance and scalability
• Usage of High-Performance Networks is beginning to draw interest
– Oracle, IBM, Google are working along these directions
• What are the challenges?
• Where do the bottlenecks lie?
• Can these bottlenecks be alleviated with new designs (similar to the designs adopted for MPI)?
9 HPC Advisory Council, Lugano Switzerland '12
• Overview of Hadoop, Memcached and HBase
• Challenges in Accelerating Enterprise Middleware
• Designs and Case Studies
– Memcached
– HBase
– HDFS
• Conclusion and Q&A
Presentation Outline
10 HPC Advisory Council, Lugano Switzerland '12
Memcached Architecture
• Integral part of Web 2.0 architecture
• Distributed Caching Layer
– Allows to aggregate spare memory from multiple nodes
– General purpose
• Typically used to cache database queries, results of API calls
• Scalable model, but typical usage very network intensive
11
!"#$%"$#&
Web Frontend Servers (Memcached Clients)
(Memcached Servers)
(Database Servers)
High
Performance
Networks
High
Performance
Networks
Main
memoryCPUs
SSD HDD
High Performance Networks
... ...
...
Main
memoryCPUs
SSD HDD
Main
memoryCPUs
SSD HDD
Main
memoryCPUs
SSD HDD
Main
memoryCPUs
SSD HDD
HPC Advisory Council, Lugano Switzerland '12
Hadoop Architecture
• Underlying Hadoop Distributed File System (HDFS)
• Fault-tolerance by replicating data blocks
• NameNode: stores information on data blocks
• DataNodes: store blocks and host Map-reduce computation
• JobTracker: track jobs and detect failure
• Model scales but high amount of communication during intermediate phases
12 HPC Advisory Council, Lugano Switzerland '12
Network-Level Interaction Between Clients and Data Nodes in HDFS
13
High
Performance
Networks
(HDD/SSD)
(HDD/SSD)
(HDD/SSD)
...
...
(HDFS Data Nodes)(HDFS Clients)
...
...
HPC Advisory Council, Lugano Switzerland '12
Overview of HBase Architecture
• An open source database project based on Hadoop framework for hosting very large tables
• Major components: HBaseMaster, HRegionServer and HBaseClient
• HBase and HDFS are deployed in the same cluster to get better data locality
14 HPC Advisory Council, Lugano Switzerland '12
Network-Level Interaction Between HBase Clients, Region Servers and Data Nodes
15
(HBase Clients) (HRegion Servers) (Data Nodes)
(HDD/SSD)
(HDD/SSD)
(HDD/SSD)
...
......
... ...
...
High
Performance
Networks
High
Performance
Networks
HPC Advisory Council, Lugano Switzerland '12
• Overview of Hadoop, Memcached and HBase
• Challenges in Accelerating Enterprise Middleware
• Designs and Case Studies
– Memcached
– HBase
– HDFS
• Conclusion and Q&A
Presentation Outline
16 HPC Advisory Council, Lugano Switzerland '12
Designing Communication and I/O Libraries for Enterprise Systems: Challenges
HPC Advisory Council, Lugano Switzerland '12 17
D at ac e n t e r M i ddl e war e (H D F S , H Ba s e , M a pRe duc e , M e m c a c he d)
N e t w orki ng T e c hnol ogi e s (Infi ni Ba nd, 1/ 10/ 40 G i G E ,
RN ICs & Int e l l i ge nt N ICs )
S t ora ge T e c hnol ogi e s (H D D or S S D )
P oi nt -t o-P oi nt
Com m uni c a t i on
Q oS
T hre a di ng M ode l s a nd
S ync hroni z a t i on
F a ul t T ol e ra nc e I/ O a nd F i l e s ys t e m s
Com m u n i c at i on an d I / O L i br ar y
P r ogr am m i n g M ode l s (S oc ke t )
Appl i c at i on s
Com m odi t y Com put i ng S ys t e m
A rc hi t e c t ure s (s i ngl e , dua l , qua d, ..)
M ul t i / M a ny-c ore A rc hi t e c t ure
a nd A c c e l e ra t ors
Common Protocols using Open Fabrics
18
Application
Verbs Sockets Application
Interface
SDP
RDMA
SDP
InfiniBand Adapter
InfiniBand Switch
RDMA
IB Verbs
InfiniBand Adapter
InfiniBand Switch
User
space
RDMA
RoCE
RoCE Adapter
User
space
Ethernet Switch
TCP/IP
Ethernet Driver
Kernel Space
Protocol Implementation
InfiniBand Adapter
InfiniBand Switch
IPoIB
Ethernet Adapter
Ethernet Switch
Network Adapter
Network Switch
1/10/40 GigE
iWARP
Ethernet Switch
iWARP
iWARP Adapter
User
space IPoIB
TCP/IP
Ethernet Adapter
Ethernet Switch
10/40 GigE-TOE
Hardware Offload
HPC Advisory Council, Lugano Switzerland '12
Can New Data Analysis and Management Systems be Designed with High-Performance Networks and Protocols?
19
Enhanced Designs
Application
Accelerated Sockets
10 GigE or InfiniBand
Verbs / Hardware Offload
Current Design
Application
Sockets
1/10 GigE Network
• Sockets not designed for high-performance
– Stream semantics often mismatch for upper layers (Memcached, HBase, Hadoop)
– Zero-copy not available for non-blocking sockets
Our Approach
Application
OSU Design
10 GigE or InfiniBand
Verbs Interface
HPC Advisory Council, Lugano Switzerland '12
Interplay between Storage and Interconnect/Protocols
• Most of the current generation enterprise systems use the
traditional hard disks
• Since hard disks are slower, high performance
communication protocols may not have impact
• SSDs and other storage technologies are emerging
• Does it change the landscape?
20 HPC Advisory Council, Lugano Switzerland '12
• Overview of Hadoop, Memcached and HBase
• Challenges in Accelerating Enterprise Middleware
• Designs and Case Studies
– Memcached
– HBase
– HDFS
• Conclusion and Q&A
Presentation Outline
21 HPC Advisory Council, Lugano Switzerland '12
Memcached Design Using Verbs
22
• Server and client perform a negotiation protocol
– Master thread assigns clients to appropriate worker thread
• Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to that thread, each verbs worker thread can support multiple clients
• All other Memcached data structures are shared among RDMA and Sockets worker threads
• Memcached applications need not be modified; uses verbs interface if available
• Memcached Server can serve both socket and verbs clients simultaneously
Sockets Client
RDMA Client
Master Thread
Sockets Worker Thread
Verbs Worker Thread
Sockets Worker Thread
Verbs Worker Thread
Shared
Data
Memory Slabs Items
…
1
1
2
2
HPC Advisory Council, Lugano Switzerland '12
Experimental Setup
• Hardware
– Intel Clovertown
• Each node has 8 processor cores on 2 Intel Xeon 2.33 GHz Quad-core CPUs,
6 GB main memory, 250 GB hard disk
• Network: 1GigE, IPoIB, 10GigE TOE and IB (DDR)
– Intel Westmere
• Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz Quad-core CPUs,
12 GB main memory, 160 GB hard disk
• Network: 1GigE, IPoIB, and IB (QDR)
• Software
– Memcached Server: 1.4.9
– Memcached Client: (libmemcached) 0.52
– In all experiments, ‘memtable’ is contained in memory (no disk
access involved)
23 HPC Advisory Council, Lugano Switzerland '12
24
• Memcached Get latency
– 4 bytes RC/UD – DDR: 6.82/7.55 us; QDR: 4.28/4.86 us
– 2K bytes RC/UD – DDR: 12.31/12.78 us; QDR: 8.19/8.46 us
• Almost factor of four improvement over 10GE (TOE) for 2K bytes on
the DDR cluster
Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)
0
20
40
60
80
100
120
140
160
180
1 2 4 8 16 32 64 128 256 512 1K 2K
Tim
e (
us)
Message Size
0
20
40
60
80
100
120
140
160
180
1 2 4 8 16 32 64 128 256 512 1K 2K
Tim
e (
us)
Message Size
SDP IPoIB
OSU-RC-IB 1GigE
10GigE OSU-UD-IB
Memcached Get Latency (Small Message)
HPC Advisory Council, Lugano Switzerland '12
HPC Advisory Council, Lugano Switzerland '12 25
Memcached Get Latency (Large Message)
• Memcached Get latency
– 8K bytes RC/UD – DDR: 18.9/19.1 us; QDR: 11.8/12.2 us
– 512K bytes RC/UD -- DDR: 369/403 us; QDR: 173/203 us
• Almost factor of two improvement over 10GE (TOE) for 512K bytes on
the DDR cluster
Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)
0
1000
2000
3000
4000
5000
6000
2K 4K 8K 16K 32K 64K 128K 256K 512K
Tim
e (
us)
Message Size
0
1000
2000
3000
4000
5000
6000
2K 4K 8K 16K 32K 64K 128K 256K 512K
Tim
e (
us)
Message Size
SDP IPoIB
OSU-RC-IB 1GigE
10GigE OSU-UD-IB
HPC Advisory Council, Lugano Switzerland '12 26
Memcached Get TPS (4byte)
• Memcached Get transactions per second for 4 bytes
– On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients
• Significant improvement with native IB QDR compared to SDP and IPoIB
0
200
400
600
800
1000
1200
1400
1600
1 2 4 8 16 32 64 128 256 512 800 1K
Tho
usa
nd
s o
f Tr
ansa
ctio
ns
pe
r se
con
d (
TPS)
No. of Clients
SDP IPoIB
OSU-RC-IB 1GigE
OSU-UD-IB
0
200
400
600
800
1000
1200
1400
1600
4 8
Tho
usa
nd
s o
f Tr
ansa
ctio
ns
pe
r se
con
d (
TPS)
No. of Clients
HPC Advisory Council, Lugano Switzerland '12 27
Memcached - Memory Scalability
• Steady Memory Footprint for UD Design
– ~ 200MB
• RC Memory Footprint increases as increase in number of clients
– ~500MB for 4K clients
0
100
200
300
400
500
600
700
1 2 4 8 16 32 64 128 256 512 800 1K 1.6K 2K 4K
Me
mo
ry F
oo
tpri
nt
(MB
)
No. of Clients
SDP IPoIB
OSU-RC-IB 1GigE
OSU-UD-IB OSU-Hybrid-IB
HPC Advisory Council, Lugano Switzerland '12 28
Application Level Evaluation – Olio Benchmark
• Olio Benchmark
– RC – 1.6 sec, UD – 1.9 sec, Hybrid – 1.7 sec for 1024 clients
• 4X times better than IPoIB for 8 clients
• Hybrid design achieves comparable performance to that of pure RC design
0
500
1000
1500
2000
2500
64 128 256 512 1024
Tim
e (
ms)
No. of Clients
0
20
40
60
80
100
120
1 4 8
Tim
e (
ms)
No. of Clients
SDP
IPoIB
OSU-RC-IB
OSU-UD-IB
OSU-Hybrid-IB
HPC Advisory Council, Lugano Switzerland '12 29
Application Level Evaluation – Real Application Workloads
• Real Application Workload – RC – 302 ms, UD – 318 ms, Hybrid – 314 ms for 1024 clients
• 12X times better than IPoIB for 8 clients • Hybrid design achieves comparable performance to that of pure RC design
0
20
40
60
80
100
120
1 4 8
Tim
e (
ms)
No. of Clients
SDP
IPoIB
OSU-RC-IB
OSU-UD-IB
OSU-Hybrid-IB
0
50
100
150
200
250
300
350
64 128 256 512 1024
Tim
e (
ms)
No. of Clients
J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K.
Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11
J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Memcached Design on
High Performance RDMA Capable Interconnects, CCGrid’12
• Overview of Hadoop, Memcached and HBase
• Challenges in Accelerating Enterprise Middleware
• Designs and Case Studies
– Memcached
– HBase
– HDFS
• Conclusion and Q&A
Presentation Outline
30 HPC Advisory Council, Lugano Switzerland '12
HBase Design Using Verbs
31
OSU Design
HBase
OSU Module
JNI Interface
InfiniBand (Verbs)
Current Design
HBase
Sockets
1/10 GigE Network
HPC Advisory Council, Lugano Switzerland '12
Experimental Setup
• Hardware
– Intel Clovertown
• Each node has 8 processor cores on 2 Intel Xeon 2.33 GHz Quad-core CPUs,
6 GB main memory, 250 GB hard disk
• Network: 1GigE, IPoIB, 10GigE TOE and IB (DDR)
– Intel Westmere
• Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz Quad-core CPUs,
12 GB main memory, 160 GB hard disk
• Network: 1GigE, IPoIB, and IB (QDR)
– 3 Nodes used
• Node1 [NameNode & HBase Master]
• Node2 [DataNode & HBase RegionServer]
• Node3 [Client]
• Software
– Hadoop 0.20.0, HBase 0.90.3 and Sun Java SDK 1.7.
– In all experiments, ‘memtable’ is contained in memory (no disk access
involved)
32 HPC Advisory Council, Lugano Switzerland '12
Details on Experiments
• Key/Value size
– Key size: 20 Bytes
– Value size: 1KB/4KB
• Get operation
– One Key/Value pair is inserted, so that Key/Value pair will stay in
memory
– Get operation is repeated 80,000 times
– Skipped first 40, 000 iterations as warm-up
• Put operation
– Memstore_Flush_Size is set to be 256 MB
– No memory flush operation involved
– Put operation is repeated 40, 000 times
– Skipped first 10, 000 iterations as warm-up
33 HPC Advisory Council, Lugano Switzerland '12
Get Operation (IB:DDR)
34
0
50
100
150
200
250
300
1K 4K
Tim
e (
us)
Message Size
1GE IPoIB10GE OSU Design
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
1K 4KO
pe
rati
on
s /s
ec
Message Size
Throughput Latency
• HBase Get Operation
– 1K bytes – 65 us (15K TPS)
– 4K bytes -- 88 us (11K TPS)
• Almost factor of two improvement over 10GE (TOE)
HPC Advisory Council, Lugano Switzerland '12
Get Operation (IB:QDR)
35
Throughput Latency
• HBase Get Operation
– 1K bytes – 47 us (22K TPS)
– 4K bytes -- 64 us (16K TPS)
• Almost factor of four improvement over IPoIB for 1KB
0
50
100
150
200
250
300
350
1K 4K
Tim
e (
us)
Message Size
1GE
IPoIB
OSU Design
0
5000
10000
15000
20000
25000
1K 4K
Op
era
tio
ns
/se
c
Message Size
HPC Advisory Council, Lugano Switzerland '12
Put Operation (IB:DDR)
36
Throughput Latency
0
50
100
150
200
250
300
350
400
1K 4K
Tim
e (
us)
Message Size
1GE IPoIB 10GE OSU Design
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
1K 4K
Op
era
tio
ns
/se
c Message Size
• HBase Put Operation
– 1K bytes – 114 us (8.7K TPS)
– 4K bytes -- 179 us (5.6K TPS)
• 34% improvement over 10GE (TOE) for 1KB
HPC Advisory Council, Lugano Switzerland '12
Put Operation (IB:QDR)
37
Throughput Latency
0
50
100
150
200
250
300
350
400
1K 4K
Tim
e (
us)
Message Size
1GE
IPoIB
OSU Design
0
2000
4000
6000
8000
10000
12000
14000
1K 4K
Op
era
tio
ns
/se
c Message Size
• HBase Put Operation
– 1K bytes – 78 us (13K TPS)
– 4K bytes -- 122 us (8K TPS)
• A factor of two improvement over IPoIB for 1KB
HPC Advisory Council, Lugano Switzerland '12
HPC Advisory Council, Lugano Switzerland '12 38
HBase Put/Get – Detailed Analysis
• HBase 1KB Put – Communication Time – 8.9 us – A factor of 6X improvement over 10GE for communication time
• HBase 1KB Get – Communication Time – 8.9 us – A factor of 6X improvement over 10GE for communication time
0
50
100
150
200
250
300
1GigE IPoIB 10GigE OSU-IB
Tim
e (
us)
HBase Put 1KB
0
50
100
150
200
250
1GigE IPoIB 10GigE OSU-IB
Tim
e (
us)
HBase Get 1KB
Communication
Communication Preparation
Server Processing
Server Serialization
Client Processing
Client Serialization
W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda, Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?, ISPASS’12
HPC Advisory Council, Lugano Switzerland '12 39
HBase Single Server-Multi-Client Results
• HBase Get latency
– 4 clients: 104.5 us; 16 clients: 296.1 us
• HBase Get throughput
– 4 clients: 37.01 Kops/sec; 16 clients: 53.4 Kops/sec
• 27% improvement in throughput for 16 clients over 10GE
0
100
200
300
400
500
600
1 2 4 8 16
Tim
e (
us)
No. of Clients
0
10000
20000
30000
40000
50000
60000
1 2 4 8 16
Op
s/se
c
No. of Clients
IPoIB
OSU-IB
1GigE
10GigE
Latency Throughput
HPC Advisory Council, Lugano Switzerland '12 40
HBase – YCSB Read-Write Workload
• HBase Get latency (Yahoo! Cloud Service Benchmark)
– 64 clients: 2.0 ms; 128 Clients: 3.5 ms
– 42% improvement over IPoIB for 128 clients
• HBase Get latency
– 64 clients: 1.9 ms; 128 Clients: 3.5 ms
– 40% improvement over IPoIB for 128 clients
0
1000
2000
3000
4000
5000
6000
7000
8 16 32 64 96 128
Tim
e (
us)
No. of Clients
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
8 16 32 64 96 128
Tim
e (
us)
No. of Clients
IPoIB OSU-IB
1GigE 10GigE
Read Latency Write Latency
J. Huang, X. Ouyang, J. Jose, W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, IPDPS’12
• Overview of Hadoop, Memcached and HBase
• Challenges in Accelerating Enterprise Middleware
• Designs and Case Studies
– Memcached
– HBase
– HDFS
• Conclusion and Q&A
Presentation Outline
41 HPC Advisory Council, Lugano Switzerland '12
• Two Kinds of Designs and Studies we have Done
– Studying the impact of HDD vs. SSD for HDFS
• Unmodified Hadoop for experiments
– Preliminary design of HDFS over Verbs
• Hadoop Experiments
– Intel Clovertown 2.33GHz, 6GB RAM, InfiniBand DDR, Chelsio T320
– Intel X-25E 64GB SSD and 250GB HDD
– Hadoop version 0.20.2, Sun/Oracle Java 1.6.0
– Dedicated NameServer and JobTracker
– Number of Datanodes used: 2, 4, and 8
42
HPC Advisory Council, Lugano Switzerland '12
Studies and Experimental Setup
Hadoop: DFS IO Write Performance
43
• DFS IO included in Hadoop, measures sequential access throughput
• We have two map tasks each writing to a file of increasing size (1-10GB)
• Significant improvement with IPoIB, SDP and 10GigE
• With SSD, performance improvement is almost seven or eight fold!
• SSD benefits not seen without using high-performance interconnect
Four Data Nodes Using HDD and SSD
0
10
20
30
40
50
60
70
80
90
1 2 3 4 5 6 7 8 9 10
Av
era
ge W
rite
Th
rou
gh
pu
t (M
B/s
ec)
File Size(GB)
1GE with HDD
IGE with SSD
IPoIB with HDD
IPoIB with SSD
SDP with HDD
SDP with SSD
10GE-TOE with HDD
10GE-TOE with SSD
HPC Advisory Council, Lugano Switzerland '12
Hadoop: RandomWriter Performance
• Each map generates 1GB of random binary data and writes to HDFS
• SSD improves execution time by 50% with 1GigE for two DataNodes
• For four DataNodes, benefits are observed only with HPC interconnect
• IPoIB, SDP and 10GigE can improve performance by 59% on four Data Nodes
44
0
100
200
300
400
500
600
700
2 4
Execu
tio
n T
ime (
sec)
Number of data nodes
1GE with HDD
IGE with SSD
IPoIB with HDD
IPoIB with SSD
SDP with HDD
SDP with SSD
10GE-TOE with HDD
10GE-TOE with SSD
HPC Advisory Council, Lugano Switzerland '12
Hadoop Sort Benchmark
45
• Sort: baseline benchmark for Hadoop
• Sort phase: I/O bound; Reduce phase: communication bound
• SSD improves performance by 28% using 1GigE with two DataNodes
• Benefit of 50% on four DataNodes using SDP, IPoIB or 10GigE
0
500
1000
1500
2000
2500
2 4
Execu
tio
n T
ime (
sec)
Number of data nodes
1GE with HDD
IGE with SSD
IPoIB with HDD
IPoIB with SSD
SDP with HDD
SDP with SSD
10GE-TOE with HDD
10GE-TOE with SSD
HPC Advisory Council, Lugano Switzerland '12
S. Sur, H. Wang, J. Huang, X. Ouyang and D. K. Panda “Can High-Performance Interconnects Benefit Hadoop Distributed File System?”, MASVDC ‘10 in conjunction with MICRO 2010, Atlanta, GA.
HDFS Design Using Verbs
46
OSU Design
HDFS
OSU Module
JNI Interface
InfiniBand (Verbs)
Current Design
HDFS
Sockets
1/10 GigE Network
HPC Advisory Council, Lugano Switzerland '12
HPC Advisory Council, Lugano Switzerland '12 47
RDMA-based Design for Native HDFS – Preliminary Results
• HDFS File Write Experiment using four data nodes on IB-DDR Cluster
• HDFS File Write Time
– 2 GB – 14 s, 5 GB – 86s,
– For 5 GB File Size - 20% improvement over IPoIB,
14% improvement over 10GigE
0
20
40
60
80
100
120
1 2 3 4 5
Tim
e (
ms)
File Size (GB)
1 GigE IPoIB
10 GigE OSU-Design
• Overview of Hadoop, Memcached and HBase
• Challenges in Accelerating Enterprise Middleware
• Designs and Case Studies
– Memcached
– HBase
– HDFS
• Conclusion and Q&A
Presentation Outline
48 HPC Advisory Council, Lugano Switzerland '12
• InfiniBand with RDMA feature is gaining momentum in HPC
systems with best performance and greater usage
• It is possible to use the RDMA feature in enterprise environments
for accelerating big data processing
• Presented some initial designs and performance numbers
• Many open research challenges remain to be solved so that
middleware for enterprise environments can take advantage of
– modern high-performance networks
– multi-core technologies
– emerging storage technologies
49
Concluding Remarks
HPC Advisory Council, Lugano Switzerland '12
Designing Communication and I/O Libraries for Enterprise Systems: Solved a Few Initial Challenges
HPC Advisory Council, Lugano Switzerland '12 50
D at ac e n t e r M i ddl e war e (H D F S , H Ba s e , M a pRe duc e , M e m c a c he d)
N e t w orki ng T e c hnol ogi e s (Infi ni Ba nd, 1/ 10/ 40 G i G E ,
RN ICs & Int e l l i ge nt N ICs )
S t ora ge T e c hnol ogi e s (H D D or S S D )
P oi nt -t o-P oi nt
Com m uni c a t i on
Q oS
T hre a di ng M ode l s a nd
S ync hroni z a t i on
F a ul t T ol e ra nc e I/ O a nd F i l e s ys t e m s
Com m u n i c at i on an d I / O L i br ar y
P r ogr am m i n g M ode l s (S oc ke t )
Appl i c at i on s
Com m odi t y Com put i ng S ys t e m
A rc hi t e c t ure s (s i ngl e , dua l , qua d, ..)
M ul t i / M a ny-c ore A rc hi t e c t ure
a nd A c c e l e ra t ors
HPC Advisory Council, Lugano Switzerland '12
Web Pointers
http://www.cse.ohio-state.edu/~panda
http://nowlab.cse.ohio-state.edu
MVAPICH Web Page
http://mvapich.cse.ohio-state.edu
51