apache spark rdd 101

Post on 03-Aug-2015

301 Views

Category:

Software

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Spark Illustrated and Illuminated

Tony Duarte

Spark and Hadoop Training

tony@sparkInternals.com

www.sparkInternals.com

(650) 223-3397

Copyright 2014 Tony Duarte 2

newRDD = myRDD.map(myfunc)

Copyright 2014 Tony Duarte 3

What is an RDD?

myRDD : RDDPartition

Partition

Partition

Partition

Some RDD CharacteristicsMemoryPartition

MemoryPartition

MemoryPartition

MemoryPartition

• Hold references to Partition objects

• Each Partition object references a subset of your data

• Partitions are assigned to nodes on your cluster

• Each partition/split will be in RAM (by default)

Array

Copyright 2014 Tony Duarte 4

What happens? newRDD = myRDD.map(myfunc)

myRDD : RDD

map() new mappedRDD(myRDD, myfunc)

newRDD : mappedRDD

dependency on myRDD

compute() stores operation: map(myfunc)

new

myfunc()

Copyright 2014 Tony Duarte

After Executing: newRDD = myRDD.map(myfunc)

5

newRDD : mappedRDD

Partition

Partition

Partition

Partition

This architecture enables:

MemoryPartition

MemoryPartition

MemoryPartition

MemoryPartition

• You can chain operations on RDDs and Spark will keep generating new RDD's

• Job Scheduling can be lazy - since a dependency chain of operations can be submited.

Array

myRDD : RDD

dependency on myRDD

stores operation: map(myfunc)

Spark Illustrated and Illuminated

Tony Duarte

Spark and Hadoop Training

tony@sparkInternals.com

www.sparkInternals.com

(650) 223-3397

top related