apache spark overview

Apache Spark

Agenda

Hadoop vs Spark: Big ‘Big Data’ question

Spark Ecosystem

What is RDD

Operations on RDD: Actions vs Transformations

Running in cluster

Task schedulers

Spark Streaming

Dataframes API

Let’s remember: MapReduce

Apache Hadoop MapReduce

Hadoop VS/AND Spark

Hadoop: DFS

Spark: Speed (RAM)

Spark ecosystem

GlossaryJobRDDStagesTasksDAGExecutorDriver

Simple Example

RDD: Resilient Distributed DatasetRepresents an immutable, partitioned collection of elements that can be operated in parallel with failure recovery possibilities.

ExampleHadoop RDD

getPartitions = HDFS blocksgetDependencies = Nonecompute = load block in memorygetPrefferedLocations = HDFS block locationspartitioner = None

MapPartitions RDDgetPartitions = same as parentgetDependencies = parent RDDcompute = compute parent and apply map()getPrefferedLocations = same as parentpartitioner = None

RDD: Resilient Distributed Dataset

RDD Example

RDD Operations● Transformations

○ Apply user function to every element in a partition

○ Apply aggregation function to a whole dataset (groupBy, sortBy)

○ Provide functionality for repartitioning (repartition, partitionBy)

● Actions

○ Materialize computation results (collect, count, take)

○ Store RDDs in memory or on disk (cache, persist)

RDD Dependencies

DAG: Directed Acyclic Graph

All the operators in a job are used to construct a DAG (Directed Acyclic Graph). The DAG is optimized by rearranging and combining operators where possible.

DAG Example

DAG Scheduler

The DAG scheduler divides operators into stages of tasks. A stage is comprised of tasks based on partitions of the input data. Pipelines operators together.

DAG Scheduler example

RDD Persistence: persist() & cache()When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it).

Storage levels: MEMORY_ONLY (default), MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.

Removing data: least-recently-used (LRU) fashion or RDD.unpersist() method.

Job execution

Task Schedulers

Standalone

Default

FIFO strategy

Controls number of CPU cores and executor memory

Hadoop oriented

Takes all available resources

Was designed for stateless batch jobs that can be restarted easily if they fail.

Resource oriented

Dynamic sharing or CPU cores

Less predictive latency

Spark Driver (application)

Running in cluster

Memory usage• Execution memory

• Storage for data needed during tasks execution• Shuffle-related data

• Storage memory• Cached RDDs• Possible to borrow from execution memory

• User memory• User data structures and internal metadata• Safeguarding against OOM

• Reserved memory• Memory needed for running executor itself

Spark Streaming

Spark Streaming: Basic Concept

Spark Streaming: ArchitectureSpark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.

Discretized Streams (DStreams)

Windowed computations

Spark Streaming checkpoints• Create heavy objects in foreachRDD• Default persistence level of DStreams keeps the data serialized in

memory.• Checkpointing (metadata and received data)• Automatic restart (task manager)• Max receiving rate• Level of Parallelism• Kryo serialization

Spark Streaming Example

Spark Dataframes (SQL)

Apache Hive• Hadoop product• Stores metadata in the relational database, but data only in HDFS• Is not suited for real time data processing• Best used for batch jobs over large datasets of immutable data

(web logs)

Is a good choice if you:• Want to query the data• When you’re familiar with SQL

About Spark SQLPart of Spark core since April 2014

Works with structured data

Mixes SQL queries with Spark programs

Connect to any datasource (files, Hive tables, external databases, RDDs)

Spark Dataframes

Spark SQL

Spark SQL with schema

Dataframes benchmark

apache spark overview

Education

apache spark overview part1 (20161107)

apache spark? if only it worked

the data scientist's guide to apache spark

apache spark - introduccion a rdds

unsupervised learning with apache spark

¿por que cambiar de apache hadoop a apache spark?

deep learning on apache spark

apache spark: untersuchung der möglichkeiten zur...

apache sparkについて

event driven architecture with apache spark and spring...

Методы повышения...

apache spark - installation

apache spark : genel bir bakış

apache spark 입문에서 머신러닝까지

apache spark チュートリアル

apache spark - big data -...

spark overview (18.06.2015)

análisis de datos con apache spark

apache spark and object stores

acelerando la innovación con apache...