a deeper-understanding-of-spark-internals-aaron-davidson

45
A Deeper Understanding of Spark’s Internals Aaron Davidson 07/01/2014

Upload: cheng-min-chi

Post on 07-Jan-2017

218 views

Category:

Software


0 download

TRANSCRIPT

Page 1: A deeper-understanding-of-spark-internals-aaron-davidson

A Deeper Understanding of Spark’s Internals Aaron Davidson"

07/01/2014

Page 2: A deeper-understanding-of-spark-internals-aaron-davidson

This Talk •  Goal: Understanding how Spark runs, focus

on performance

•  Major core components: – Execution Model – The Shuffle – Caching

Page 3: A deeper-understanding-of-spark-internals-aaron-davidson

This Talk •  Goal: Understanding how Spark runs, focus

on performance

•  Major core components: – Execution Model – The Shuffle – Caching

Page 4: A deeper-understanding-of-spark-internals-aaron-davidson

Why understand internals? Goal: Find number of distinct names per “first letter” sc.textFile(“hdfs:/names”)

.map(name => (name.charAt(0), name))

.groupByKey()

.mapValues(names => names.toSet.size)

.collect()

Page 5: A deeper-understanding-of-spark-internals-aaron-davidson

Why understand internals? Goal: Find number of distinct names per “first letter” sc.textFile(“hdfs:/names”)

.map(name => (name.charAt(0), name))

.groupByKey()

.mapValues(names => names.toSet.size)

.collect()

Andy Pat Ahir

Page 6: A deeper-understanding-of-spark-internals-aaron-davidson

Why understand internals? Goal: Find number of distinct names per “first letter” sc.textFile(“hdfs:/names”)

.map(name => (name.charAt(0), name))

.groupByKey()

.mapValues(names => names.toSet.size)

.collect()

Andy Pat Ahir

(A, Andy) (P, Pat) (A, Ahir)

Page 7: A deeper-understanding-of-spark-internals-aaron-davidson

Why understand internals? Goal: Find number of distinct names per “first letter” sc.textFile(“hdfs:/names”)

.map(name => (name.charAt(0), name))

.groupByKey()

.mapValues(names => names.toSet.size)

.collect()

Andy Pat Ahir

(A, [Ahir, Andy]) (P, [Pat])

(A, Andy) (P, Pat) (A, Ahir)

Page 8: A deeper-understanding-of-spark-internals-aaron-davidson

Why understand internals? Goal: Find number of distinct names per “first letter” sc.textFile(“hdfs:/names”)

.map(name => (name.charAt(0), name))

.groupByKey()

.mapValues(names => names.toSet.size)

.collect()

Andy Pat Ahir

(A, [Ahir, Andy]) (P, [Pat])

(A, Andy) (P, Pat) (A, Ahir)

Page 9: A deeper-understanding-of-spark-internals-aaron-davidson

Why understand internals? Goal: Find number of distinct names per “first letter” sc.textFile(“hdfs:/names”)

.map(name => (name.charAt(0), name))

.groupByKey()

.mapValues(names => names.toSet.size)

.collect()

Andy Pat Ahir

(A, [Ahir, Andy]) (P, [Pat])

(A, Set(Ahir, Andy)) (P, Set(Pat))

(A, Andy) (P, Pat) (A, Ahir)

Page 10: A deeper-understanding-of-spark-internals-aaron-davidson

Why understand internals? Goal: Find number of distinct names per “first letter” sc.textFile(“hdfs:/names”)

.map(name => (name.charAt(0), name))

.groupByKey()

.mapValues(names => names.toSet.size)

.collect()

Andy Pat Ahir

(A, [Ahir, Andy]) (P, [Pat])

(A, 2) (P, 1)

(A, Andy) (P, Pat) (A, Ahir)

Page 11: A deeper-understanding-of-spark-internals-aaron-davidson

Why understand internals? Goal: Find number of distinct names per “first letter” sc.textFile(“hdfs:/names”)

.map(name => (name.charAt(0), name))

.groupByKey()

.mapValues(names => names.toSet.size)

.collect()

Andy Pat Ahir

(A, [Ahir, Andy]) (P, [Pat])

(A, 2) (P, 1)

(A, Andy) (P, Pat) (A, Ahir)

res0 = [(A, 2), (P, 1)]

Page 12: A deeper-understanding-of-spark-internals-aaron-davidson

Spark Execution Model 1.  Create DAG of RDDs to represent

computation 2.  Create logical execution plan for DAG 3.  Schedule and execute individual tasks

Page 13: A deeper-understanding-of-spark-internals-aaron-davidson

Step 1: Create RDDs

sc.textFile(“hdfs:/names”)

map(name => (name.charAt(0), name))

groupByKey()

mapValues(names => names.toSet.size)

collect()

Page 14: A deeper-understanding-of-spark-internals-aaron-davidson

Step 1: Create RDDs

HadoopRDD

map()

groupBy()

mapValues()

collect()

Page 15: A deeper-understanding-of-spark-internals-aaron-davidson

Step 2: Create execution plan •  Pipeline as much as possible •  Split into “stages” based on need to

reorganize data

Stage 1 HadoopRDD

map()

groupBy()

mapValues()

collect()

Andy Pat Ahir

(A, [Ahir, Andy]) (P, [Pat])

(A, 2)

(A, Andy) (P, Pat) (A, Ahir)

res0 = [(A, 2), ...]

Page 16: A deeper-understanding-of-spark-internals-aaron-davidson

Step 2: Create execution plan •  Pipeline as much as possible •  Split into “stages” based on need to

reorganize data

Stage 1

Stage 2

HadoopRDD

map()

groupBy()

mapValues()

collect()

Andy Pat Ahir

(A, [Ahir, Andy]) (P, [Pat])

(A, 2) (P, 1)

(A, Andy) (P, Pat) (A, Ahir)

res0 = [(A, 2), (P, 1)]

Page 17: A deeper-understanding-of-spark-internals-aaron-davidson

•  Split each stage into tasks •  A task is data + computation •  Execute all tasks within a stage before

moving on

Step 3: Schedule tasks

Page 18: A deeper-understanding-of-spark-internals-aaron-davidson

Step 3: Schedule tasks Computation Data

hdfs:/names/0.gz

hdfs:/names/1.gz

hdfs:/names/2.gz

Task 0 Task 1 Task 2

hdfs:/names/3.gz

Stage 1 HadoopRDD

map() Task 3

hdfs:/names/0.gz

Task 0

HadoopRDD

map()

hdfs:/names/1.gz

Task 1

HadoopRDD

map()

Page 19: A deeper-understanding-of-spark-internals-aaron-davidson

Step 3: Schedule tasks

/names/0.gz

/names/3.gz

/names/0.gz

HadoopRDD

map() Time

HDFS

/names/1.gz

/names/2.gz

HDFS

/names/2.gz

/names/3.gz

HDFS

Page 20: A deeper-understanding-of-spark-internals-aaron-davidson

Step 3: Schedule tasks

/names/0.gz

/names/3.gz

HDFS

/names/1.gz

/names/2.gz

HDFS

/names/2.gz

/names/3.gz

HDFS

/names/0.gz

HadoopRDD

map() Time

Page 21: A deeper-understanding-of-spark-internals-aaron-davidson

Step 3: Schedule tasks

/names/0.gz

/names/3.gz

HDFS

/names/1.gz

/names/2.gz

HDFS

/names/2.gz

/names/3.gz

HDFS

/names/0.gz

HadoopRDD

map()

Time

Page 22: A deeper-understanding-of-spark-internals-aaron-davidson

Step 3: Schedule tasks

/names/0.gz

/names/3.gz

HDFS

/names/1.gz

/names/2.gz

HDFS

/names/2.gz

/names/3.gz

HDFS

/names/0.gz

HadoopRDD

map()

/names/1.gz

HadoopRDD

map() Time

Page 23: A deeper-understanding-of-spark-internals-aaron-davidson

Step 3: Schedule tasks

/names/0.gz

/names/3.gz

HDFS

/names/1.gz

/names/2.gz

HDFS

/names/2.gz

/names/3.gz

HDFS

/names/0.gz

HadoopRDD

map()

/names/1.gz

HadoopRDD

map()

Time

Page 24: A deeper-understanding-of-spark-internals-aaron-davidson

Step 3: Schedule tasks

/names/0.gz

/names/3.gz

HDFS

/names/1.gz

/names/2.gz

HDFS

/names/2.gz

/names/3.gz

HDFS

/names/0.gz

HadoopRDD

map()

/names/2.gz

HadoopRDD

map()

/names/1.gz

HadoopRDD

map()

Time

Page 25: A deeper-understanding-of-spark-internals-aaron-davidson

Step 3: Schedule tasks

/names/0.gz

/names/3.gz

HDFS

/names/1.gz

/names/2.gz

HDFS

/names/2.gz

/names/3.gz

HDFS

/names/0.gz

HadoopRDD

map()

/names/1.gz

HadoopRDD

map()

/names/2.gz

HadoopRDD

map()

Time

Page 26: A deeper-understanding-of-spark-internals-aaron-davidson

Step 3: Schedule tasks

/names/0.gz

/names/3.gz

HDFS

/names/1.gz

/names/2.gz

HDFS

/names/2.gz

/names/3.gz

HDFS

/names/0.gz

HadoopRDD

map()

/names/1.gz

HadoopRDD

map()

/names/2.gz

HadoopRDD

map()

Time

/names/3.gz

HadoopRDD

map()

Page 27: A deeper-understanding-of-spark-internals-aaron-davidson

Step 3: Schedule tasks

/names/0.gz

/names/3.gz

HDFS

/names/1.gz

/names/2.gz

HDFS

/names/2.gz

/names/3.gz

HDFS

/names/0.gz

HadoopRDD

map()

/names/1.gz

HadoopRDD

map()

/names/2.gz

HadoopRDD

map()

Time

/names/3.gz

HadoopRDD

map()

Page 28: A deeper-understanding-of-spark-internals-aaron-davidson

Step 3: Schedule tasks

/names/0.gz

/names/3.gz

HDFS

/names/1.gz

/names/2.gz

HDFS

/names/2.gz

/names/3.gz

HDFS

/names/0.gz

HadoopRDD

map()

/names/1.gz

HadoopRDD

map()

/names/2.gz

HadoopRDD

map()

Time

/names/3.gz

HadoopRDD

map()

Page 29: A deeper-understanding-of-spark-internals-aaron-davidson

The Shuffle

Stage 1

Stage 2

HadoopRDD

map()

groupBy()

mapValues()

collect()

Page 30: A deeper-understanding-of-spark-internals-aaron-davidson

The Shuffle

Stage  1  

Stage  2  

•  Redistributes data among partitions •  Hash keys into buckets •  Optimizations: – Avoided when possible, if"

data is already properly"partitioned

– Partial aggregation reduces"data movement

Page 31: A deeper-understanding-of-spark-internals-aaron-davidson

The Shuffle

Disk  

Stage  2  

Stage  1  

•  Pull-based, not push-based •  Write intermediate files to disk

Page 32: A deeper-understanding-of-spark-internals-aaron-davidson

Execution of a groupBy() •  Build hash map within each partition

•  Note: Can spill across keys, but a single key-value pair must fit in memory

A => [Arsalan, Aaron, Andrew, Andrew, Andy, Ahir, Ali, …], E => [Erin, Earl, Ed, …] …

Page 33: A deeper-understanding-of-spark-internals-aaron-davidson

Done!

Stage 1

Stage 2

HadoopRDD

map()

groupBy()

mapValues()

collect()

Page 34: A deeper-understanding-of-spark-internals-aaron-davidson

What went wrong? •  Too few partitions to get good concurrency •  Large per-key groupBy() •  Shipped all data across the cluster

Page 35: A deeper-understanding-of-spark-internals-aaron-davidson

Common issue checklist 1.  Ensure enough partitions for concurrency 2.  Minimize memory consumption (esp. of

sorting and large keys in groupBys) 3.  Minimize amount of data shuffled 4.  Know the standard library

1 & 2 are about tuning number of partitions!

Page 36: A deeper-understanding-of-spark-internals-aaron-davidson

Importance of Partition Tuning •  Main issue: too few partitions –  Less concurrency – More susceptible to data skew –  Increased memory pressure for groupBy,

reduceByKey, sortByKey, etc. •  Secondary issue: too many partitions •  Need “reasonable number” of partitions – Commonly between 100 and 10,000 partitions –  Lower bound: At least ~2x number of cores in

cluster – Upper bound: Ensure tasks take at least 100ms

Page 37: A deeper-understanding-of-spark-internals-aaron-davidson

Memory Problems •  Symptoms: –  Inexplicably bad performance –  Inexplicable executor/machine failures"

(can indicate too many shuffle files too) •  Diagnosis: –  Set spark.executor.extraJavaOptions to include

•  -XX:+PrintGCDetails •  -XX:+HeapDumpOnOutOfMemoryError

–  Check dmesg for oom-killer logs •  Resolution: –  Increase spark.executor.memory –  Increase number of partitions –  Re-evaluate program structure (!)

Page 38: A deeper-understanding-of-spark-internals-aaron-davidson

Fixing our mistakes sc.textFile(“hdfs:/names”)

.map(name => (name.charAt(0), name)) .groupByKey()

.mapValues { names => names.toSet.size }

.collect()

1.  Ensure enough partitions for concurrency

2.  Minimize memory consumption (esp. of large groupBys and sorting)

3.  Minimize data shuffle 4.  Know the standard library

Page 39: A deeper-understanding-of-spark-internals-aaron-davidson

Fixing our mistakes sc.textFile(“hdfs:/names”)

.repartition(6) .map(name => (name.charAt(0), name))

.groupByKey()

.mapValues { names => names.toSet.size }

.collect()

1.  Ensure enough partitions for concurrency

2.  Minimize memory consumption (esp. of large groupBys and sorting)

3.  Minimize data shuffle 4.  Know the standard library

Page 40: A deeper-understanding-of-spark-internals-aaron-davidson

Fixing our mistakes sc.textFile(“hdfs:/names”)

.repartition(6) .distinct() .map(name => (name.charAt(0), name))

.groupByKey()

.mapValues { names => names.toSet.size }

.collect()

1.  Ensure enough partitions for concurrency

2.  Minimize memory consumption (esp. of large groupBys and sorting)

3.  Minimize data shuffle 4.  Know the standard library

Page 41: A deeper-understanding-of-spark-internals-aaron-davidson

Fixing our mistakes sc.textFile(“hdfs:/names”)

.repartition(6) .distinct() .map(name => (name.charAt(0), name))

.groupByKey()

.mapValues { names => names.size } .collect()

1.  Ensure enough partitions for concurrency

2.  Minimize memory consumption (esp. of large groupBys and sorting)

3.  Minimize data shuffle 4.  Know the standard library

Page 42: A deeper-understanding-of-spark-internals-aaron-davidson

Fixing our mistakes sc.textFile(“hdfs:/names”)

.distinct(numPartitions = 6) .map(name => (name.charAt(0), name))

.groupByKey()

.mapValues { names => names.size } .collect()

1.  Ensure enough partitions for concurrency

2.  Minimize memory consumption (esp. of large groupBys and sorting)

3.  Minimize data shuffle 4.  Know the standard library

Page 43: A deeper-understanding-of-spark-internals-aaron-davidson

Fixing our mistakes sc.textFile(“hdfs:/names”)

.distinct(numPartitions = 6) .map(name => (name.charAt(0), 1)) .reduceByKey(_ + _) .collect()

1.  Ensure enough partitions for concurrency

2.  Minimize memory consumption (esp. of large groupBys and sorting)

3.  Minimize data shuffle 4.  Know the standard library

Page 44: A deeper-understanding-of-spark-internals-aaron-davidson

Fixing our mistakes sc.textFile(“hdfs:/names”)

.distinct(numPartitions = 6) .map(name => (name.charAt(0), 1)) .reduceByKey(_ + _) .collect()

Original: sc.textFile(“hdfs:/names”)

.map(name => (name.charAt(0), name))

.groupByKey()

.mapValues { names => names.toSet.size } .collect()

Page 45: A deeper-understanding-of-spark-internals-aaron-davidson

Questions?