resilient distributed datasets

Post on 21-Apr-2017

1.121 Views

Category:

Data & Analytics

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTINGMATEI ZAHARIA, MOSHARAF CHOWDHURY, TATHAGATA DAS, ANKUR DAVE, JUSTIN MA, MURPHY MCCAULEY, MICHAEL J. FRANKLIN, SCOTT SHENKER, ION STOICA.

NSDI'12 PROCEEDINGS OF THE 9TH USENIX CONFERENCE ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION

PAPERS WE LOVE AMSTERDAM AUGUST 13, 2015

@gabriele_modena

(C) PRESENTATION BY GABRIELE MODENA, 2015

About me• CS.ML

• Data science & predictive modelling

• with a sprinkle of systems work

• Hadoop & c. for data wrangling & crunching numbers

• … and Spark

(C) PRESENTATION BY GABRIELE MODENA, 2015

(C) PRESENTATION BY GABRIELE MODENA, 2015

We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-

tolerant manner. RDDs are motivated by two types of applications that current computing

frameworks handle inefficiently: iterative algorithms and interactive data mining tools.

(C) PRESENTATION BY GABRIELE MODENA, 2015

How• Review (concepts from) key related work

• RDD + Spark

• Some critiques

(C) PRESENTATION BY GABRIELE MODENA, 2015

Related work• MapReduce

• Dryad

• Hadoop Distributed FileSystem (HDFS)

• Mesos

(C) PRESENTATION BY GABRIELE MODENA, 2015

What’s an iterative algorithm anyway?

data = input data w = <target vector> for i in num_iterations: for item in data:

update(w)

Multiple input scans

At each iteration, do something

Update a shared data structure

(C) PRESENTATION BY GABRIELE MODENA, 2015

HDFS

• GFS paper (2003)

• Distributed storage (with replication)

• Block ops

• NameNode hashes file locations (blocks)

Data Node

Data Node

Data Node

Name Node

(C) PRESENTATION BY GABRIELE MODENA, 2015

HDFS

• GFS paper (2003)

• Distributed storage (with replication)

• Block ops

• NameNode hashes file locations (blocks)

Data Node

Data Node

Data Node

Name Node

(C) PRESENTATION BY GABRIELE MODENA, 2015

HDFS

• GFS paper (2003)

• Distributed storage (with replication)

• Block ops

• NameNode hashes file locations (blocks)

Data Node

Data Node

Data Node

Name Node

(C) PRESENTATION BY GABRIELE MODENA, 2015

MapReduce• Google paper (2004)

• Apache Hadoop (~2007)

• Divide and conquer functional model

• Goes hand-in-hand with HDFS

• Structure data as (key, value)

1. Map(): filter and project

emit (k, v) pairs

2. Reduce(): aggregate and summarise

group by key and count

Map Map Map

Reduce Reduce

HDFS (blocks)

HDFS

(C) PRESENTATION BY GABRIELE MODENA, 2015

MapReduce• Google paper (2004)

• Apache Hadoop (~2007)

• Divide and conquer functional model

• Goes hand-in-hand with HDFS

• Structure data as (key, value)

1. Map(): filter and project

emit (k, v) pairs

2. Reduce(): aggregate and summarise

group by key and count

Map Map Map

Reduce Reduce

HDFS (blocks)

HDFS

This is a test Yes it is a test …

(C) PRESENTATION BY GABRIELE MODENA, 2015

MapReduce• Google paper (2004)

• Apache Hadoop (~2007)

• Divide and conquer functional model

• Goes hand-in-hand with HDFS

• Structure data as (key, value)

1. Map(): filter and project

emit (k, v) pairs

2. Reduce(): aggregate and summarise

group by key and count

Map Map Map

Reduce Reduce

HDFS (blocks)

HDFS

This is a test Yes it is a test …

(This,1), (is, 1), (a, 1), (test., 1), (Yes, 1), (it, 1), (is, 1)

(C) PRESENTATION BY GABRIELE MODENA, 2015

MapReduce• Google paper (2004)

• Apache Hadoop (~2007)

• Divide and conquer functional model

• Goes hand-in-hand with HDFS

• Structure data as (key, value)

1. Map(): filter and project

emit (k, v) pairs

2. Reduce(): aggregate and summarise

group by key and count

Map Map Map

Reduce Reduce

HDFS (blocks)

HDFS

This is a test Yes it is a test …

(This,1), (is, 1), (a, 1), (test., 1), (Yes, 1), (it, 1), (is, 1)

(This, 1), (is, 2), (a, 2), (test, 2), (Yes, 1), (it, 1)

(C) PRESENTATION BY GABRIELE MODENA, 2015

(c) Image from Apache Tez http://tez.apache.org

(C) PRESENTATION BY GABRIELE MODENA, 2015

Critiques to MR and HDFS

• Great when records (and jobs) are independent

• In reality expect data to be shuffled across the network

• Latency measured in minutes

• Performance hit for iterative methods

• Composability monsters

• Meant for batch workflows

(C) PRESENTATION BY GABRIELE MODENA, 2015

Dryad• Microsoft paper (2007)

• Inspired Apache Tez

• Generalisation of MapReduce via I/O pipelining

• Applications are (direct acyclic) graphs of tasks

(C) PRESENTATION BY GABRIELE MODENA, 2015

DryadDAG dag = new DAG("WordCount");

dag.addVertex(tokenizerVertex) .addVertex(summerVertex) .addEdge(new Edge(tokenizerVertex, summerVertex, edgeConf.createDefaultEdgeProperty()) );

(C) PRESENTATION BY GABRIELE MODENA, 2015

MapReduce and DryadSELECT a.country, COUNT(b.place_id) FROM place a JOIN tweets b ON (a. place_id = b.place_id) GROUP BY a.country;

(c) Image from Apache Tez http://tez.apache.org. Modified.

(C) PRESENTATION BY GABRIELE MODENA, 2015

Critiques to Dryad • No explicit abstraction for data sharing

• Must express data reps as DAG

• Partial solution: DryadLINQ

• No notion of a distributed filesystem

• How to handle large inputs?

• Local writes / remote reads?

(C) PRESENTATION BY GABRIELE MODENA, 2015

Resilient Distributed Datasets Read-only, partitioned collection of records => a distributed immutable array

accessed via coarse-grained transformations

=> apply a function (scala closure) to all elements of the array

Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj

(C) PRESENTATION BY GABRIELE MODENA, 2015

Resilient Distributed Datasets Read-only, partitioned collection of records => a distributed immutable array

accessed via coarse-grained transformations

=> apply a function (scala closure) to all elements of the array

Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj

(C) PRESENTATION BY GABRIELE MODENA, 2015

Spark

• Transformations - lazily create RDDswc = dataset.flatMap(tokenize) .reduceByKey(add)

• Actions - execute computationwc.collect()

Runtime and API

(C) PRESENTATION BY GABRIELE MODENA, 2015

Applications

Driver

Worker

Worker

Worker

input data

input data

input data

RAM

RAM

results

tasks

RAM

(C) PRESENTATION BY GABRIELE MODENA, 2015

Applications• Driver code defines RDDs

and invokes actions

Driver

Worker

Worker

Worker

input data

input data

input data

RAM

RAM

results

tasks

RAM

(C) PRESENTATION BY GABRIELE MODENA, 2015

Applications• Driver code defines RDDs

and invokes actions• Submit to long lived

workers, that store partitions in memory

Driver

Worker

Worker

Worker

input data

input data

input data

RAM

RAM

results

tasks

RAM

(C) PRESENTATION BY GABRIELE MODENA, 2015

Applications• Driver code defines RDDs

and invokes actions• Submit to long lived

workers, that store partitions in memory

• Scala closures are serialised as Java objects and passed across the network over HTTP Driver

Worker

Worker

Worker

input data

input data

input data

RAM

RAM

results

tasks

RAM

(C) PRESENTATION BY GABRIELE MODENA, 2015

Applications• Driver code defines RDDs

and invokes actions• Submit to long lived

workers, that store partitions in memory

• Scala closures are serialised as Java objects and passed across the network over HTTP

• Variables bound to the closure are saved in the serialised object

Driver

Worker

Worker

Worker

input data

input data

input data

RAM

RAM

results

tasks

RAM

(C) PRESENTATION BY GABRIELE MODENA, 2015

Applications• Driver code defines RDDs

and invokes actions• Submit to long lived

workers, that store partitions in memory

• Scala closures are serialised as Java objects and passed across the network over HTTP

• Variables bound to the closure are saved in the serialised object

• Closures are deserialised on each worker and applied to the RDD (partition)

Driver

Worker

Worker

Worker

input data

input data

input data

RAM

RAM

results

tasks

RAM

(C) PRESENTATION BY GABRIELE MODENA, 2015

Applications• Driver code defines RDDs

and invokes actions• Submit to long lived

workers, that store partitions in memory

• Scala closures are serialised as Java objects and passed across the network over HTTP

• Variables bound to the closure are saved in the serialised object

• Closures are deserialised on each worker and applied to the RDD (partition)

• Mesos takes care of resource management

Driver

Worker

Worker

Worker

input data

input data

input data

RAM

RAM

results

tasks

RAM

(C) PRESENTATION BY GABRIELE MODENA, 2015

Data persistance1. in memory as deserialized java object

2. in memory as serialized data

3. on disk

RDD Checkpointing

Memory management via LRU eviction policy

.persist() RDD for future reuse

(C) PRESENTATION BY GABRIELE MODENA, 2015

Lineage lines = spark.textFile(“hdfs://...")

errors = lines.filter(_.startsWith("ERROR"))

errors.persist()

errors.filter(_.contains("HDFS")) .map(_.split(’\t’)(3)) .collect()

(C) PRESENTATION BY GABRIELE MODENA, 2015

Lineage lines

lines = spark.textFile(“hdfs://...")

errors = lines.filter(_.startsWith("ERROR"))

errors.persist()

errors.filter(_.contains("HDFS")) .map(_.split(’\t’)(3)) .collect()

(C) PRESENTATION BY GABRIELE MODENA, 2015

Lineage lines

lines = spark.textFile(“hdfs://...")

errors = lines.filter(_.startsWith("ERROR"))

errors.persist()

errors.filter(_.contains("HDFS")) .map(_.split(’\t’)(3)) .collect()

filter(_.startsWith("ERROR"))

(C) PRESENTATION BY GABRIELE MODENA, 2015

Lineage lines

errors

lines = spark.textFile(“hdfs://...")

errors = lines.filter(_.startsWith("ERROR"))

errors.persist()

errors.filter(_.contains("HDFS")) .map(_.split(’\t’)(3)) .collect()

filter(_.startsWith("ERROR"))

(C) PRESENTATION BY GABRIELE MODENA, 2015

Lineage lines

errors

lines = spark.textFile(“hdfs://...")

errors = lines.filter(_.startsWith("ERROR"))

errors.persist()

errors.filter(_.contains("HDFS")) .map(_.split(’\t’)(3)) .collect()

filter(_.startsWith("ERROR"))

filter(_.contains(“HDFS”))

(C) PRESENTATION BY GABRIELE MODENA, 2015

Lineage lines

errors

hdfs errors

lines = spark.textFile(“hdfs://...")

errors = lines.filter(_.startsWith("ERROR"))

errors.persist()

errors.filter(_.contains("HDFS")) .map(_.split(’\t’)(3)) .collect()

filter(_.startsWith("ERROR"))

filter(_.contains(“HDFS”))

(C) PRESENTATION BY GABRIELE MODENA, 2015

Lineage lines

errors

hdfs errors

lines = spark.textFile(“hdfs://...")

errors = lines.filter(_.startsWith("ERROR"))

errors.persist()

errors.filter(_.contains("HDFS")) .map(_.split(’\t’)(3)) .collect()

filter(_.startsWith("ERROR"))

filter(_.contains(“HDFS”))

map(_.split(’\t’)(3))

(C) PRESENTATION BY GABRIELE MODENA, 2015

Lineage lines

errors

hdfs errors

time fields

lines = spark.textFile(“hdfs://...")

errors = lines.filter(_.startsWith("ERROR"))

errors.persist()

errors.filter(_.contains("HDFS")) .map(_.split(’\t’)(3)) .collect()

filter(_.startsWith("ERROR"))

filter(_.contains(“HDFS”))

map(_.split(’\t’)(3))

(C) PRESENTATION BY GABRIELE MODENA, 2015

Lineage lines

errors

hdfs errors

time fields

lines = spark.textFile(“hdfs://...")

errors = lines.filter(_.startsWith("ERROR"))

errors.persist()

errors.filter(_.contains("HDFS")) .map(_.split(’\t’)(3)) .collect()

filter(_.startsWith("ERROR"))

filter(_.contains(“HDFS”))

map(_.split(’\t’)(3))

(C) PRESENTATION BY GABRIELE MODENA, 2015

Lineage lines

errors

hdfs errors

time fields

lines = spark.textFile(“hdfs://...")

errors = lines.filter(_.startsWith("ERROR"))

errors.persist()

errors.filter(_.contains("HDFS")) .map(_.split(’\t’)(3)) .collect()

filter(_.startsWith("ERROR"))

filter(_.contains(“HDFS”))

map(_.split(’\t’)(3))

(C) PRESENTATION BY GABRIELE MODENA, 2015

Lineage

Fault recovery

If a partition is lost, derived it back from the lineage

lines

errors

hdfs errors

time fields

lines = spark.textFile(“hdfs://...")

errors = lines.filter(_.startsWith("ERROR"))

errors.persist()

errors.filter(_.contains("HDFS")) .map(_.split(’\t’)(3)) .collect()

filter(_.startsWith("ERROR"))

filter(_.contains(“HDFS”))

map(_.split(’\t’)(3))

(C) PRESENTATION BY GABRIELE MODENA, 2015

RepresentationChallenge: track lineage across transformations

1. Partitions 2. Data locality for partition p 3. List dependencies 4. Iterator function to compute a dataset

based on its parents 5. Metadata for the partitioner scheme

(C) PRESENTATION BY GABRIELE MODENA, 2015

Narrow dependenciespipelined execution on one cluster node

map, filterunion

(C) PRESENTATION BY GABRIELE MODENA, 2015

Wide dependenciesrequire data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce-like operation

groupByKeyjoin with inputs

not co-partitioned

(C) PRESENTATION BY GABRIELE MODENA, 2015

Scheduling

Task are allocated based on data locality (delayed scheduling)

1. Action is triggered => compute the RDD 2. Based on lineage, build a graph of stages to execute 3. Each stage contains as many pipelined

transformations with narrow dependencies as possible

4. Launch tasks to compute missing partitions from each stage until it has computed the target RDD

5. If a task fails => re-run it on another node as long as its stage’s parents are still available.

(C) PRESENTATION BY GABRIELE MODENA, 2015

Job execution

union

map

groupBy

join

B

C D

E

F

G

Stage 3Stage 2

A

Stage 1

(C) PRESENTATION BY GABRIELE MODENA, 2015

Job execution

union

map

groupBy

join

B

C D

E

F

G

Stage 3Stage 2

A

Stage 1

B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()

(C) PRESENTATION BY GABRIELE MODENA, 2015

Job executionG

B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()

(C) PRESENTATION BY GABRIELE MODENA, 2015

Job execution

join

B

F

G

B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()

(C) PRESENTATION BY GABRIELE MODENA, 2015

Job execution

join

B

F

GgroupBy

A

B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()

(C) PRESENTATION BY GABRIELE MODENA, 2015

Job execution

union

D

E

join

B

F

GgroupBy

A

B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()

(C) PRESENTATION BY GABRIELE MODENA, 2015

Job execution

map

C

union

D

E

join

B

F

GgroupBy

A

B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()

(C) PRESENTATION BY GABRIELE MODENA, 2015

Job execution

map

C

union

D

E

join

B

F

GgroupBy

A

B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()

(C) PRESENTATION BY GABRIELE MODENA, 2015

Job execution

map

C

union

D

E

join

B

F

GgroupBy

A

B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()

(C) PRESENTATION BY GABRIELE MODENA, 2015

Job execution

map

C

union

D

E

join

B

F

GgroupBy

A

B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()

(C) PRESENTATION BY GABRIELE MODENA, 2015

Job execution

map

C

union

D

E

join

B

F

GgroupBy

A

Stage 1

B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()

(C) PRESENTATION BY GABRIELE MODENA, 2015

Job execution

map

C

union

D

E

join

B

F

G

Stage 2

groupByA

Stage 1

B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()

(C) PRESENTATION BY GABRIELE MODENA, 2015

Job execution

map

C

union

D

E

join

B

F

G

Stage 3Stage 2

groupByA

Stage 1

B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()

(C) PRESENTATION BY GABRIELE MODENA, 2015

Evaluation

(C) PRESENTATION BY GABRIELE MODENA, 2015

Some critiques (to the paper) Some critiques (to the paper)• How general is this approach? • We are still doing MapReduce

• Concerns wrt iterative algorithms still stand • CPU bound workloads? • Linear Algebra?

• How much tuning is required? • How does the partitioner work? • What is the cost of reconstructing an RDD from

lineage? • Performance when data does not fit in memory

• Eg. a join between two very large non co-partitioned RDDs

(C) PRESENTATION BY GABRIELE MODENA, 2015

References (Theory)Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. Zaharia et. al, Proceedings of NSDI’12. https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Spark: cluster computing with working sets. Zaharia et. al, Proceedings of HotCloud'10. http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf

The Google File System. Ghemawat, Gobioff, Leung, 19th ACM Symposium on Operating Systems Principles, 2003. http://research.google.com/archive/gfs.html

MapReduce: Simplified Data Processing on Large Clusters. Dean, Ghemawat, OSDI'04: Sixth Symposium on Operating System Design and Implementation. http://research.google.com/archive/mapreduce.html

Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly European Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007. http://research.microsoft.com/en-us/projects/dryad/eurosys07.pdf

Mesos: a platform for fine-grained resource sharing in the data center, Hindman et. al, Proceedings of NSDI’11. https://www.cs.berkeley.edu/~alig/papers/mesos.pdf

(C) PRESENTATION BY GABRIELE MODENA, 2015

References (Practice)• An overview of the pyspark API through pictures https://github.com/jkthompson/

pyspark-pictures • Barry Brumitt’s presentation on MapReduce design patterns (UW CSE490)

http://courses.cs.washington.edu/courses/cse490h/08au/lectures/MapReduceDesignPatterns-UW2.pdf

• The Dryad Project http://research.microsoft.com/en-us/projects/dryad/ • Apache Spark http://spark.apache.org • Apache Hadoop https://hadoop.apache.org • Apache Tez https://tez.apache.org • Apache Mesos http://mesos.apache.org

top related