a comparative performance evaluation of apache flink

A Comparative Performance Evaluation of Flink

Dongwon KimPOSTECH

2

About Me• Postdoctoral researcher @ POSTECH

• Research interest• Design and implementation of distributed systems• Performance optimization of big data processing engines

• Doctoral thesis• MR2: Fault Tolerant MapReduce with the Push Model

• Personal blog• http://eastcirclek.blogspot.kr• Why I’m here

http://eastcirclek.blogspot.kr/

3

Outline

• TeraSort for various engines

• Experimental setup• Results & analysis• What else for better performance?• Conclusion

4

TeraSort• Hadoop MapReduce program for the annual terabyte sort competition

• TeraSort is essentially distributed sort (DS)

a4b3 a1

a2

b1b2

a2b1 a3

a4

b3b4

Disk

a2a4

b3b1

a1b4

a3b2

a1a3

b4b2Disk

a2a4a1a3

b3b1

b4b2

read shufflinglocal sort write

Disk

Disk

local sort

Node 1

Node 2

Typical DS phases :

Total order

5

• Included in Hadoop distributions• with TeraGen & TeraValidate

• Identity map & reduce functions

• Range partitioner built on sampling• To guarantee a total order & to prevent partition skew• Sampling to compute boundary points within few seconds

TeraSort for MapReduce

Reduce taskMap taskread shuffling sortsort reducemap write

read shufflinglocal sort writelocal sortDS phases :

reducemap

Record range…Partition 1 Partition 2 Partition r

boundary points

6

• Tez can execute TeraSort for MapReduce w/o any modification• mapreduce.framework.name = yarn-tez

• Tez DAG plan of TeraSort for MapReduce

TeraSort for Tez

finalreduce vertex

initialmap vertex

Map taskread sortmap

Reduce taskshuffling sort reduce write

input data

output data

7

TeraSort for Spark & Flink• My source code in GitHub:

• https://github.com/eastcirclek/terasort

• Sampling-based range partitioner from TeraSort for MapRe-duce• Visit my personal blog for a detailed explanation• http://eastcirclek.blogspot.kr

https://github.com/eastcirclek/terasort

https://github.com/eastcirclek/terasort



8

RDD1 RDD2

• Code

• Two RDDs

TeraSort for Spark

Stage 1Stage 0

Shuffle-Map Task(for newAPIHadoopFile)

read sort

Result Task(for repartitionAndSortWithinPartitions)

shuffling sort write

read shufflinglocal sort writelocal sort

Create a new RDD to read from HDFS

# partitions = # blocks

Repartition the parent RDD based on the user-specified partitioner

Write output to HDFS

DS phases :

9

Pipeline

• Code

• Pipelines consisting of four operators

TeraSort for Flink

read shuffling writelocal sort

Create a dataset to read tu-ples from HDFS

partition tuples

Sort tuples of each partition

Data-Source Partition SortPartition DataSink

local sort

No map-side sorting due to pipelined execution

Write output to HDFS

DS phases :

10

Importance of TeraSort• Suitable for measuring the pure performance of big data en-

gines• No data transformation (like map, filter) with user-defined logic• Basic facilities of each engine are used

• “Winning the sort benchmark” is a great means of PR

11

Outline

• TeraSort for various engines• Experimental setup• Machine specification• Node configuration

• Results & analysis• What else for better performance?• Conclusion

12

Machine specification (42 identical machines)

DELL PowerEdge R610

CPUTwo X5650 proces-

sors(Total 12 cores)

MemoryTotal 24Gb

Disk6 disks * 500GB/disk

Network10 Gigabit Ethernet

My machine Spark teamProces-

sorIntel Xeon X5650

(Q1, 2010)Intel Xeon E5-2670

(Q1, 2012)Cores 6 * 2 processors 8 * 4 processors

Memory 24GB 244GBDisks 6 HDD's 8 SSD's

Results can be differ-ent

in newer machines

13

24GB on each node

Node configuration

Total 2 GBfor daemons

13 GB

Tez-0.7.0

NodeManager (1 GB)ShuffleSer-

vice

MapTask (1GB)

DataNode (1 GB)

MapTask (1GB)

ReduceTask (1GB)ReduceTask (1GB)

MapTask (1GB)

MapTask (1GB)MapTask (1GB)

…

MapReduce-2.7.1

NodeManager (1 GB)ShuffleSer-

vice

DataNode (1 GB)

MapTask (1GB)

MapTask (1GB)


MapTask (1GB)

MapTask (1GB)MapTask (1GB)

…12 GB

FlinkSpark

Spark-1.5.1

NodeManager (1 GB)

Executor (12GB)

Internal memory layout

Various managers

DataNode (1 GB)

Task slot 1Task slot 2

Task slot 12...

Thread pool

Flink-0.9.1

NodeManager (1 GB)

TaskManager (12GB)

DataNode (1 GB)

Internal memory layout

Various managers

Task slot 1Task slot 2

Task slot 12...

Task threads

TezMapRe-

duce


12 simultaneous tasks at most Driver (1GB) JobManager (1GB)

14

Outline

• TeraSort for various engines• Experimental setup• Results & analysis• Flink is faster than other engines due to its pipelined exe-

cution

• What else for better performance?• Conclusion

15

How to read a swimlane graph & throughput graphsTasks

Time since job starts (seconds)

2nd stage

1st2nd

3rd 4th5th

6th

Cluster network throughput

Cluster disk throughput

InOut

Disk read

DiskWrite

- 6 waves of 1st stage tasks- 1 wave of 2nd stage tasks

- Two stages are hardly over-lapped

1st stage

2nd stage

1st stage

2nd stage

No network traffic during 1st stage

Each line : duration of each task

Different patterns for different stages

16

Result of sorting 80GB/node (3.2TB)

1480 sec

1st stage

1st stage

1st stage

2nd stage 2157 sec

2nd stage 2171 sec

1 DataSource2 Partition

3 SortPartition4 DataSink

• Flink is the fastest due to its pipelined exe-cution• Tez and Spark do not overlap 1st and 2nd stages• MapReduce is slow despite overlapping stages

MapReducein Hadoop-2.7.1

Tez-0.7.0

Spark-1.5.1

Flink-0.9.1

2nd stage 1887 sec

MapReduce _x000d_in Hadoop-

2.7.1

Tez-0.7.0 Spark-1.5.1 Flink-0.9.10

500

1000

1500

2000

25002157

18872171

1480

Tim

e (s

econ

ds)

* Map output compression turned on for Spark and Tez

* *

17

Tez and Spark do not overlap 1st and 2nd stages



InOut

Disk readCluster net-work through-put


InOut

Disk read

Diskwrite

Diskwrite

Disk read

Disk write

Out

In

(1) 2nd stage starts

(2) Output of 1st stage is

sent

(1) 2nd stage starts

(2) Output of 1st stage is

sent(1)

Network traffic occurs from start

Cluster net-work throughput

(2)Write to HDFS occurs right after shuffling is

done

1 DataSource2 Partition

3 SortPartition4 DataSink

idle idle

(3)Disk write to HDFS oc-

curs after shuffling is done

(3)Disk write to HDFS oc-

curs after shuffling is done

18

Tez does not overlap 1st and 2nd stages

• Tez has parameters to control the degree of overlap• tez.shuffle-vertex-manager.min-src-fraction : 0.2• tez.shuffle-vertex-manager.max-src-fraction : 0.4

• However, 2nd stage is scheduled early but launched late

scheduled launched

19

Spark does not overlap 1st and 2nd stages

• Spark cannot execute multiple stages simultaneously• also mentioned in the following VLDB paper (2015)

Spark doesn’t support the overlap between shuffle write and read stages.

…Spark may want to support this

overlap in the future to improve per-formance.

Experimental results of this paper- Spark is faster than MapReduce for WordCount, K-means,

PageRank.- MapReduce is faster than Spark for Sort.

20

MapReduce is slow despite overlapping stages• mapreduce.job.reduce.slowstart.completedMaps : [0.0, 1.0]

• Wang’s attempt to overlap spark stages

0.05(overlapping, default)

0.95 (no overlapping)

2157 sec

10% improvement

Wang proposes to overlap stages to achieve better uti-

lization 10%???

Why Spark & MapReduce improve just 10%?

2385 sec

2nd stage

1st stage

21

Disk

Data transfer between tasks of different stages

Output fileP1 P2 Pn

Shuffle server

…

ConsumerTask 1

ConsumerTask 2

ConsumerTask n

P1

…

P2

Pn

Traditional pull model- Used in MapReduce, Spark, Tez - Extra disk access & simultaneous disk access- Shuffling affects the performance of producers

Producer Task

(1)Write out-put to disk

(2)Request P1(3)

Send P1

Pipelined data transfer - Used in Flink- Data transfer from memory to memory- Flink causes fewer disk access during shuffling

Leads to only 10% improve-ment

22

Flink causes fewer disk access during shuffling

MapReduce Flink diff.

Total disk write (TB)

9.9 6.5 3.4

Total disk read (TB)

8.1 6.9 1.2

Difference comes from

shuffling

Shuffled data are sometimes read from page cache


Disk read

Disk write

Disk read

Disk write


FlinkMapReduce

Total amount of disk read/write

equals tothe area of blue/green region

23

Result of TeraSort with various data sizes

node data size(GB)

Time (seconds)

Flink Spark MapReduce Tez

10 157 387 259 277

20 350 652 555 729

40 741 1135 1085 1709

80 1480 2171 2157 1887

160 3127 4927 4796 3950

10 20 40 80 160100

1000

10000

Flink Spark MapReduce Tez

node data size (GB)

Tim

e (s

econ

ds)

What we’ve seen

Log scale

* Map output compression turned on for Spark and Tez

24

Result of HashJoin• 10 slave nodes• org.apache.tez.examples.JoinDataGen

• Small dataset : 256MB• Large dataset : 240GB (24GB/node)

• Result :

Visit my blog

Flink is

~2x faster than Tez~4x faster than

Spark

Tez-0.7.0 Spark-1.5.1 Flink-0.9.10

200

400

600

800

1000

1200

1400

1600

1800

770

1538

378

Tim

e (s

econ

ds)

* No map output compression for both Spark and Tez unlike in TeraSort

25

Result of HashJoin with swimlane & throughput graphs

Idle

1 DataSource2 DataSource3 Join4 DataSink

Idle


Cluster disk through-put

InOut

Diskread

Diskwrite

Disk read

Disk write

In

OutIn

Out

Disk read

Diskwrite


Cluster disk through-put

0.24 TB

0.41 TB0.60 TB 0.84 TB

0.68 TB

0.74 TB

Over-lap

2nd

3rd

26

Flink’s shortcoming• No support for map output compression

• Small data blocks are pipelined between operators

• Job-level fault tolerance• Shuffle data are not materialized

• Low disk throughput during the post-shuffling phase

27

Low disk throughput during the post-shuffling phase

• Possible reason : sorting records from small files • Concurrent disk access to small files too many disk seeks

low disk throughput• Other engines merge records from larger files than Flink

• “Eager pipelining moves some of the sorting work from the mapper to the reducer” • from MapReduce online (NSDI 2010)

Flink Tez MapReduce

28

Outline

• TeraSort for various engines• Experimental setup• Results & analysis• What else for better performance?• Conclusion

29

MR2 – another MapReduce engine• PhD thesis

• MR2: Fault Tolerant MapReduce with the Push Model• developed for 3 years

• Provide the user interface of Hadoop MapReduce• No DAG support• No in-memory computation• No iterative-computation

• Characteristics• Push model + Fault tolerance• Techniques to boost up HDD throughput

• Prefetching for mappers• Preloading for reducers

30

MR2 pipeline

• 7 types of components with memory buffers1. Mappers & reducers : to apply user-defined functions2. Prefetcher & preloader : to eliminate concurrent disk access3. Sender & reducer & merger : to implement MR2’s push model• Various buffers : to pass data between components w/o disk IOs

• Minimum disk access (2 disk reads & 2 disk writes)• +1 disk write for fault tolerance

W1 R2 W

2R1

1 12 23 3 3

W3

31

Prefetcher & Mappers• Prefetcher loads data for multiple mappers• Mappers do not read input from disks

<MR2><Hadoop MapReduce>

Mapper1 processing Blk1

Mapper2 processing Blk2Time

Diskthroughput

CPUutilization

2 mapperson a node

Blk1

Time

Prefetcher Blk2 Blk3

Blk12

Blk11

Blk22

Blk21

Blk32

Blk31

Blk4

Blk42

Blk41

Diskthroughput

CPUutilization

2 mapperson a node

32

Push-model in MR2

• Node-to-node network connection for pushing data• To reduce # network connections

• Data transfer from memory buffer• Mappers stores spills in send buffer• Spills are pushed to reducer sides by sender

• Fault tolerance (can be turned on/off)• Input ranges of each spill are known to master for reproduce• For fast recovery

• store spills on disk for fast recovery (extra disk write)

similar to Flink’s pipelined execu-tion

MR2 does local sorting before pushing data

similar to Spark

33

Receiver’s managed memory

Receiver & merger & preloader & reducer• Merger produces a file from different partition data

• sorts each partition data • and then does interleaving

• Preloader preloads each group into reduce buffer• Reducers do not read data directly from disks

• MR2 can eliminate concurrent disk reads from reducers thanks to Preloader

P1 P2 P3 P4P1 P2 P3 P4

P1 P2 P3 P4… …

Preloader loads each group(1 disk access for 4 partitions)

34

Result of sorting 80GB/node (3.2TB) with MR2

MapReduce in Hadoop-2.7.1 Tez-0.7.0 Spark-1.5.1 Flink-0.9.1 MR2

Time (sec) 2157 1887 2171 1480 890

MR2 speedup over other engines 2.42 2.12 2.44 1.66 -

MapReduce _x000d_in Hadoop-2.7.1

Tez-0.7.0 Spark-1.5.1 Flink-0.9.1 MR20

500

1000

1500

2000

25002157

18872171

1480

890

Tim

e (s

econ

ds)

35

Disk & network throughput

1. DataSource / Mapping• Prefetcher is effective• MR2 shows higher disk

throughput

2. Partition / Shuffling• Records to shuffle are

generated faster from in MR2

3. DataSink / Reducing• Preloader is effective• Almost 2x throughput

Disk read

Disk write

Out

In



Out

In

Disk read

Disk write

Flink MR2

1 1

2

2

33

36

• Experimental results using 10 nodes

PUMA (PUrdue MApreduce benchmarks suite)

37

Outline

• TeraSort for various engines• Experimental setup• Experimental results & analysis• What else for better performance?• Conclusion

38

Conclusion

• Pipelined execution for both batch and streaming processing

• Even better than other batch processing engines for TeraSort & HashJoin

• Shortcomings due to pipelined execution• No fine-grained fault tolerance• No map output compression• Low disk throughput during the post-shuffling phase

39

Thank you!

Any question?

a comparative performance evaluation of apache flink

Engineering