a comparative performance evaluation of apache flink
TRANSCRIPT
A Comparative Performance Evaluation of Flink
Dongwon KimPOSTECH
2
About Me• Postdoctoral researcher @ POSTECH
• Research interest• Design and implementation of distributed systems• Performance optimization of big data processing engines
• Doctoral thesis• MR2: Fault Tolerant MapReduce with the Push Model
• Personal blog• http://eastcirclek.blogspot.kr• Why I’m here
3
Outline
• TeraSort for various engines
• Experimental setup• Results & analysis• What else for better performance?• Conclusion
4
TeraSort• Hadoop MapReduce program for the annual terabyte sort competition
• TeraSort is essentially distributed sort (DS)
a4b3 a1
a2
b1b2
a2b1 a3
a4
b3b4
Disk
a2a4
b3b1
a1b4
a3b2
a1a3
b4b2Disk
a2a4a1a3
b3b1
b4b2
read shufflinglocal sort write
Disk
Disk
local sort
Node 1
Node 2
Typical DS phases :
Total order
5
• Included in Hadoop distributions• with TeraGen & TeraValidate
• Identity map & reduce functions
• Range partitioner built on sampling• To guarantee a total order & to prevent partition skew• Sampling to compute boundary points within few seconds
TeraSort for MapReduce
Reduce taskMap taskread shuffling sortsort reducemap write
read shufflinglocal sort writelocal sortDS phases :
reducemap
Record range…Partition 1 Partition 2 Partition r
boundary points
6
• Tez can execute TeraSort for MapReduce w/o any modification• mapreduce.framework.name = yarn-tez
• Tez DAG plan of TeraSort for MapReduce
TeraSort for Tez
finalreduce vertex
initialmap vertex
Map taskread sortmap
Reduce taskshuffling sort reduce write
input data
output data
7
TeraSort for Spark & Flink• My source code in GitHub:
• https://github.com/eastcirclek/terasort
• Sampling-based range partitioner from TeraSort for MapRe-duce• Visit my personal blog for a detailed explanation• http://eastcirclek.blogspot.kr
8
RDD1 RDD2
• Code
• Two RDDs
TeraSort for Spark
Stage 1Stage 0
Shuffle-Map Task(for newAPIHadoopFile)
read sort
Result Task(for repartitionAndSortWithinPartitions)
shuffling sort write
read shufflinglocal sort writelocal sort
Create a new RDD to read from HDFS
# partitions = # blocks
Repartition the parent RDD based on the user-specified partitioner
Write output to HDFS
DS phases :
9
Pipeline
• Code
• Pipelines consisting of four operators
TeraSort for Flink
read shuffling writelocal sort
Create a dataset to read tu-ples from HDFS
partition tuples
Sort tuples of each partition
Data-Source Partition SortPartition DataSink
local sort
No map-side sorting due to pipelined execution
Write output to HDFS
DS phases :
10
Importance of TeraSort• Suitable for measuring the pure performance of big data en-
gines• No data transformation (like map, filter) with user-defined logic• Basic facilities of each engine are used
• “Winning the sort benchmark” is a great means of PR
11
Outline
• TeraSort for various engines• Experimental setup• Machine specification• Node configuration
• Results & analysis• What else for better performance?• Conclusion
12
Machine specification (42 identical machines)
DELL PowerEdge R610
CPUTwo X5650 proces-
sors(Total 12 cores)
MemoryTotal 24Gb
Disk6 disks * 500GB/disk
Network10 Gigabit Ethernet
My machine Spark teamProces-
sorIntel Xeon X5650
(Q1, 2010)Intel Xeon E5-2670
(Q1, 2012)Cores 6 * 2 processors 8 * 4 processors
Memory 24GB 244GBDisks 6 HDD's 8 SSD's
Results can be differ-ent
in newer machines
13
24GB on each node
Node configuration
Total 2 GBfor daemons
13 GB
Tez-0.7.0
NodeManager (1 GB)ShuffleSer-
vice
MapTask (1GB)
DataNode (1 GB)
MapTask (1GB)
ReduceTask (1GB)ReduceTask (1GB)
MapTask (1GB)
MapTask (1GB)MapTask (1GB)
…
MapReduce-2.7.1
NodeManager (1 GB)ShuffleSer-
vice
DataNode (1 GB)
MapTask (1GB)
MapTask (1GB)
ReduceTask (1GB)ReduceTask (1GB)
MapTask (1GB)
MapTask (1GB)MapTask (1GB)
…12 GB
FlinkSpark
Spark-1.5.1
NodeManager (1 GB)
Executor (12GB)
Internal memory layout
Various managers
DataNode (1 GB)
Task slot 1Task slot 2
Task slot 12...
Thread pool
Flink-0.9.1
NodeManager (1 GB)
TaskManager (12GB)
DataNode (1 GB)
Internal memory layout
Various managers
Task slot 1Task slot 2
Task slot 12...
Task threads
TezMapRe-
duce
ReduceTask (1GB)ReduceTask (1GB)
12 simultaneous tasks at most Driver (1GB) JobManager (1GB)
14
Outline
• TeraSort for various engines• Experimental setup• Results & analysis• Flink is faster than other engines due to its pipelined exe-
cution
• What else for better performance?• Conclusion
15
How to read a swimlane graph & throughput graphsTasks
Time since job starts (seconds)
2nd stage
1st2nd
3rd 4th5th
6th
Cluster network throughput
Cluster disk throughput
InOut
Disk read
DiskWrite
- 6 waves of 1st stage tasks- 1 wave of 2nd stage tasks
- Two stages are hardly over-lapped
1st stage
2nd stage
1st stage
2nd stage
No network traffic during 1st stage
Each line : duration of each task
Different patterns for different stages
16
Result of sorting 80GB/node (3.2TB)
1480 sec
1st stage
1st stage
1st stage
2nd stage 2157 sec
2nd stage 2171 sec
1 DataSource2 Partition
3 SortPartition4 DataSink
• Flink is the fastest due to its pipelined exe-cution• Tez and Spark do not overlap 1st and 2nd stages• MapReduce is slow despite overlapping stages
MapReducein Hadoop-2.7.1
Tez-0.7.0
Spark-1.5.1
Flink-0.9.1
2nd stage 1887 sec
MapReduce _x000d_in Hadoop-
2.7.1
Tez-0.7.0 Spark-1.5.1 Flink-0.9.10
500
1000
1500
2000
25002157
18872171
1480
Tim
e (s
econ
ds)
* Map output compression turned on for Spark and Tez
* *
17
Tez and Spark do not overlap 1st and 2nd stages
Cluster network throughput
Cluster disk throughput
InOut
Disk readCluster net-work through-put
Cluster disk throughput
InOut
Disk read
Diskwrite
Diskwrite
Disk read
Disk write
Out
In
(1) 2nd stage starts
(2) Output of 1st stage is
sent
(1) 2nd stage starts
(2) Output of 1st stage is
sent(1)
Network traffic occurs from start
Cluster net-work throughput
(2)Write to HDFS occurs right after shuffling is
done
1 DataSource2 Partition
3 SortPartition4 DataSink
idle idle
(3)Disk write to HDFS oc-
curs after shuffling is done
(3)Disk write to HDFS oc-
curs after shuffling is done
18
Tez does not overlap 1st and 2nd stages
• Tez has parameters to control the degree of overlap• tez.shuffle-vertex-manager.min-src-fraction : 0.2• tez.shuffle-vertex-manager.max-src-fraction : 0.4
• However, 2nd stage is scheduled early but launched late
scheduled launched
19
Spark does not overlap 1st and 2nd stages
• Spark cannot execute multiple stages simultaneously• also mentioned in the following VLDB paper (2015)
Spark doesn’t support the overlap between shuffle write and read stages.
…Spark may want to support this
overlap in the future to improve per-formance.
Experimental results of this paper- Spark is faster than MapReduce for WordCount, K-means,
PageRank.- MapReduce is faster than Spark for Sort.
20
MapReduce is slow despite overlapping stages• mapreduce.job.reduce.slowstart.completedMaps : [0.0, 1.0]
• Wang’s attempt to overlap spark stages
0.05(overlapping, default)
0.95 (no overlapping)
2157 sec
10% improvement
Wang proposes to overlap stages to achieve better uti-
lization 10%???
Why Spark & MapReduce improve just 10%?
2385 sec
2nd stage
1st stage
21
Disk
Data transfer between tasks of different stages
Output fileP1 P2 Pn
Shuffle server
…
ConsumerTask 1
ConsumerTask 2
ConsumerTask n
P1
…
P2
Pn
Traditional pull model- Used in MapReduce, Spark, Tez - Extra disk access & simultaneous disk access- Shuffling affects the performance of producers
Producer Task
(1)Write out-put to disk
(2)Request P1(3)
Send P1
Pipelined data transfer - Used in Flink- Data transfer from memory to memory- Flink causes fewer disk access during shuffling
Leads to only 10% improve-ment
22
Flink causes fewer disk access during shuffling
MapReduce Flink diff.
Total disk write (TB)
9.9 6.5 3.4
Total disk read (TB)
8.1 6.9 1.2
Difference comes from
shuffling
Shuffled data are sometimes read from page cache
Cluster disk throughput
Disk read
Disk write
Disk read
Disk write
Cluster disk throughput
FlinkMapReduce
Total amount of disk read/write
equals tothe area of blue/green region
23
Result of TeraSort with various data sizes
node data size(GB)
Time (seconds)
Flink Spark MapReduce Tez
10 157 387 259 277
20 350 652 555 729
40 741 1135 1085 1709
80 1480 2171 2157 1887
160 3127 4927 4796 3950
10 20 40 80 160100
1000
10000
Flink Spark MapReduce Tez
node data size (GB)
Tim
e (s
econ
ds)
What we’ve seen
Log scale
* Map output compression turned on for Spark and Tez
24
Result of HashJoin• 10 slave nodes• org.apache.tez.examples.JoinDataGen
• Small dataset : 256MB• Large dataset : 240GB (24GB/node)
• Result :
Visit my blog
Flink is
~2x faster than Tez~4x faster than
Spark
Tez-0.7.0 Spark-1.5.1 Flink-0.9.10
200
400
600
800
1000
1200
1400
1600
1800
770
1538
378
Tim
e (s
econ
ds)
* No map output compression for both Spark and Tez unlike in TeraSort
25
Result of HashJoin with swimlane & throughput graphs
Idle
1 DataSource2 DataSource3 Join4 DataSink
Idle
Cluster network throughput
Cluster disk through-put
InOut
Diskread
Diskwrite
Disk read
Disk write
In
OutIn
Out
Disk read
Diskwrite
Cluster network throughput
Cluster disk through-put
0.24 TB
0.41 TB0.60 TB 0.84 TB
0.68 TB
0.74 TB
Over-lap
2nd
3rd
26
Flink’s shortcoming• No support for map output compression
• Small data blocks are pipelined between operators
• Job-level fault tolerance• Shuffle data are not materialized
• Low disk throughput during the post-shuffling phase
27
Low disk throughput during the post-shuffling phase
• Possible reason : sorting records from small files • Concurrent disk access to small files too many disk seeks
low disk throughput• Other engines merge records from larger files than Flink
• “Eager pipelining moves some of the sorting work from the mapper to the reducer” • from MapReduce online (NSDI 2010)
Flink Tez MapReduce
28
Outline
• TeraSort for various engines• Experimental setup• Results & analysis• What else for better performance?• Conclusion
29
MR2 – another MapReduce engine• PhD thesis
• MR2: Fault Tolerant MapReduce with the Push Model• developed for 3 years
• Provide the user interface of Hadoop MapReduce• No DAG support• No in-memory computation• No iterative-computation
• Characteristics• Push model + Fault tolerance• Techniques to boost up HDD throughput
• Prefetching for mappers• Preloading for reducers
30
MR2 pipeline
• 7 types of components with memory buffers1. Mappers & reducers : to apply user-defined functions2. Prefetcher & preloader : to eliminate concurrent disk access3. Sender & reducer & merger : to implement MR2’s push model• Various buffers : to pass data between components w/o disk IOs
• Minimum disk access (2 disk reads & 2 disk writes)• +1 disk write for fault tolerance
W1 R2 W
2R1
1 12 23 3 3
W3
31
Prefetcher & Mappers• Prefetcher loads data for multiple mappers• Mappers do not read input from disks
<MR2><Hadoop MapReduce>
Mapper1 processing Blk1
Mapper2 processing Blk2Time
Diskthroughput
CPUutilization
2 mapperson a node
Blk1
Time
Prefetcher Blk2 Blk3
Blk12
Blk11
Blk22
Blk21
Blk32
Blk31
Blk4
Blk42
Blk41
Diskthroughput
CPUutilization
2 mapperson a node
32
Push-model in MR2
• Node-to-node network connection for pushing data• To reduce # network connections
• Data transfer from memory buffer• Mappers stores spills in send buffer• Spills are pushed to reducer sides by sender
• Fault tolerance (can be turned on/off)• Input ranges of each spill are known to master for reproduce• For fast recovery
• store spills on disk for fast recovery (extra disk write)
similar to Flink’s pipelined execu-tion
MR2 does local sorting before pushing data
similar to Spark
33
Receiver’s managed memory
Receiver & merger & preloader & reducer• Merger produces a file from different partition data
• sorts each partition data • and then does interleaving
• Preloader preloads each group into reduce buffer• Reducers do not read data directly from disks
• MR2 can eliminate concurrent disk reads from reducers thanks to Preloader
P1 P2 P3 P4P1 P2 P3 P4
P1 P2 P3 P4… …
Preloader loads each group(1 disk access for 4 partitions)
34
Result of sorting 80GB/node (3.2TB) with MR2
MapReduce in Hadoop-2.7.1 Tez-0.7.0 Spark-1.5.1 Flink-0.9.1 MR2
Time (sec) 2157 1887 2171 1480 890
MR2 speedup over other engines 2.42 2.12 2.44 1.66 -
MapReduce _x000d_in Hadoop-2.7.1
Tez-0.7.0 Spark-1.5.1 Flink-0.9.1 MR20
500
1000
1500
2000
25002157
18872171
1480
890
Tim
e (s
econ
ds)
35
Disk & network throughput
1. DataSource / Mapping• Prefetcher is effective• MR2 shows higher disk
throughput
2. Partition / Shuffling• Records to shuffle are
generated faster from in MR2
3. DataSink / Reducing• Preloader is effective• Almost 2x throughput
Disk read
Disk write
Out
In
Cluster network throughput
Cluster disk throughput
Out
In
Disk read
Disk write
Flink MR2
1 1
2
2
33
36
• Experimental results using 10 nodes
PUMA (PUrdue MApreduce benchmarks suite)
37
Outline
• TeraSort for various engines• Experimental setup• Experimental results & analysis• What else for better performance?• Conclusion
38
Conclusion
• Pipelined execution for both batch and streaming processing
• Even better than other batch processing engines for TeraSort & HashJoin
• Shortcomings due to pipelined execution• No fine-grained fault tolerance• No map output compression• Low disk throughput during the post-shuffling phase
39
Thank you!
Any question?