apache spark rdd 101

6
Spark Illustrated and Illuminated Tony Duarte Spark and Hadoop Training [email protected] www.sparkInternals.com (650) 223-3397

Upload: sparkinstructor

Post on 03-Aug-2015

301 views

Category:

Software


1 download

TRANSCRIPT

Page 1: Apache Spark RDD 101

Spark Illustrated and Illuminated

Tony Duarte

Spark and Hadoop Training

[email protected]

www.sparkInternals.com

(650) 223-3397

Page 2: Apache Spark RDD 101

Copyright 2014 Tony Duarte 2

newRDD = myRDD.map(myfunc)

Page 3: Apache Spark RDD 101

Copyright 2014 Tony Duarte 3

What is an RDD?

myRDD : RDDPartition

Partition

Partition

Partition

Some RDD CharacteristicsMemoryPartition

MemoryPartition

MemoryPartition

MemoryPartition

• Hold references to Partition objects

• Each Partition object references a subset of your data

• Partitions are assigned to nodes on your cluster

• Each partition/split will be in RAM (by default)

Array

Page 4: Apache Spark RDD 101

Copyright 2014 Tony Duarte 4

What happens? newRDD = myRDD.map(myfunc)

myRDD : RDD

map() new mappedRDD(myRDD, myfunc)

newRDD : mappedRDD

dependency on myRDD

compute() stores operation: map(myfunc)

new

myfunc()

Page 5: Apache Spark RDD 101

Copyright 2014 Tony Duarte

After Executing: newRDD = myRDD.map(myfunc)

5

newRDD : mappedRDD

Partition

Partition

Partition

Partition

This architecture enables:

MemoryPartition

MemoryPartition

MemoryPartition

MemoryPartition

• You can chain operations on RDDs and Spark will keep generating new RDD's

• Job Scheduling can be lazy - since a dependency chain of operations can be submited.

Array

myRDD : RDD

dependency on myRDD

stores operation: map(myfunc)

Page 6: Apache Spark RDD 101

Spark Illustrated and Illuminated

Tony Duarte

Spark and Hadoop Training

[email protected]

www.sparkInternals.com

(650) 223-3397