apache spark rdd 101

Spark Illustrated and Illuminated

Tony Duarte

Spark and Hadoop Training

tony@sparkInternals.com

www.sparkInternals.com

(650) 223-3397

newRDD = myRDD.map(myfunc)

What is an RDD?

myRDD : RDDPartition

Partition

Some RDD CharacteristicsMemoryPartition

MemoryPartition

• Hold references to Partition objects

• Each Partition object references a subset of your data

• Partitions are assigned to nodes on your cluster

• Each partition/split will be in RAM (by default)

What happens? newRDD = myRDD.map(myfunc)

myRDD : RDD

map() new mappedRDD(myRDD, myfunc)

newRDD : mappedRDD

dependency on myRDD

compute() stores operation: map(myfunc)

myfunc()

After Executing: newRDD = myRDD.map(myfunc)

newRDD : mappedRDD

Partition

This architecture enables:

MemoryPartition

• You can chain operations on RDDs and Spark will keep generating new RDD's

• Job Scheduling can be lazy - since a dependency chain of operations can be submited.

myRDD : RDD

dependency on myRDD

stores operation: map(myfunc)

Spark Illustrated and Illuminated

Tony Duarte

Spark and Hadoop Training

tony@sparkInternals.com

www.sparkInternals.com

(650) 223-3397

apache spark rdd 101

Software

introducciÓn a apache spark con...

apache spark チュートリアル

#hstokyo16 apache spark crash course

uvod u apache spark zagreb meetup

spark programming...

deep learning on apache spark

machine learning com apache spark

beneath rdd in apache spark by jacek laskowski

apache spark performance observations

acelerando la innovación con apache...

¿por que cambiar de apache hadoop a apache spark?

Методы повышения...

apache spark - big data -...

unsupervised learning with apache spark

aws meetup「apache spark on emr」

apache spark - installation

apache spark 입문에서 머신러닝까지

apache spark overview part1 (20161107)

spark 의 핵심은 무엇인가? rdd! (rdd paper review)

apache sparkについて