apache spark overview
TRANSCRIPT
Apache Spark
Agenda
Hadoop vs Spark: Big ‘Big Data’ question
Spark Ecosystem
What is RDD
Operations on RDD: Actions vs Transformations
Running in cluster
Task schedulers
Spark Streaming
Dataframes API
Let’s remember: MapReduce
Apache Hadoop MapReduce
Hadoop VS/AND Spark
Hadoop: DFS
Spark: Speed (RAM)
Spark ecosystem
GlossaryJobRDDStagesTasksDAGExecutorDriver
Simple Example
RDD: Resilient Distributed DatasetRepresents an immutable, partitioned collection of elements that can be operated in parallel with failure recovery possibilities.
ExampleHadoop RDD
getPartitions = HDFS blocksgetDependencies = Nonecompute = load block in memorygetPrefferedLocations = HDFS block locationspartitioner = None
MapPartitions RDDgetPartitions = same as parentgetDependencies = parent RDDcompute = compute parent and apply map()getPrefferedLocations = same as parentpartitioner = None
RDD: Resilient Distributed Dataset
RDD Example
RDD Example
RDD Operations● Transformations
○ Apply user function to every element in a partition
○ Apply aggregation function to a whole dataset (groupBy, sortBy)
○ Provide functionality for repartitioning (repartition, partitionBy)
● Actions
○ Materialize computation results (collect, count, take)
○ Store RDDs in memory or on disk (cache, persist)
RDD Dependencies
DAG: Directed Acyclic Graph
All the operators in a job are used to construct a DAG (Directed Acyclic Graph). The DAG is optimized by rearranging and combining operators where possible.
DAG Example
DAG Scheduler
The DAG scheduler divides operators into stages of tasks. A stage is comprised of tasks based on partitions of the input data. Pipelines operators together.
DAG Scheduler example
RDD Persistence: persist() & cache()When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it).
Storage levels: MEMORY_ONLY (default), MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.
Removing data: least-recently-used (LRU) fashion or RDD.unpersist() method.
Job execution
Task Schedulers
Standalone
Default
FIFO strategy
Controls number of CPU cores and executor memory
YARN
Hadoop oriented
Takes all available resources
Was designed for stateless batch jobs that can be restarted easily if they fail.
Mesos
Resource oriented
Dynamic sharing or CPU cores
Less predictive latency
Spark Driver (application)
Running in cluster
Memory usage• Execution memory
• Storage for data needed during tasks execution• Shuffle-related data
• Storage memory• Cached RDDs• Possible to borrow from execution memory
• User memory• User data structures and internal metadata• Safeguarding against OOM
• Reserved memory• Memory needed for running executor itself
Spark Streaming
Spark Streaming: Basic Concept
Spark Streaming: ArchitectureSpark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.
Discretized Streams (DStreams)
Windowed computations
Spark Streaming checkpoints• Create heavy objects in foreachRDD• Default persistence level of DStreams keeps the data serialized in
memory.• Checkpointing (metadata and received data)• Automatic restart (task manager)• Max receiving rate• Level of Parallelism• Kryo serialization
Spark Streaming Example
Spark Dataframes (SQL)
Apache Hive• Hadoop product• Stores metadata in the relational database, but data only in HDFS• Is not suited for real time data processing• Best used for batch jobs over large datasets of immutable data
(web logs)
Is a good choice if you:• Want to query the data• When you’re familiar with SQL
About Spark SQLPart of Spark core since April 2014
Works with structured data
Mixes SQL queries with Spark programs
Connect to any datasource (files, Hive tables, external databases, RDDs)
Spark Dataframes
Spark Dataframes
Spark SQL
Spark SQL with schema
Dataframes benchmark
Q&A