introduction to apache spark - big data · 2018-01-31 · introduction to apache spark bu eğitim...

Introduction to Apache Spark

Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı

kapsamında yürütülmekte olan TR10/16/YNY/0036 no’lu İstanbul Big Data Eğitim ve Araştırma Merkezi Projesi

dahilinde gerçekleştirilmiştir. İçerik ile ilgili tek sorumluluk Bahçeşehir Üniversitesi’ne ait olup İSTKA veya Kalkınma

Bakanlığı’nın görüşlerini yansıtmamaktadır.

A Major Step Backwards?

MapReduce is a step backward in database access:

Schemas are good

Separation of the schema from the application is good

High-level access languages are good

MapReduce is poor implementation

Brute force and only brute force (no indexes, for example)

MapReduce is not novel

MapReduce is missing features

Bulk loader, indexing, updates, transactions…

MapReduce is incompatible with DMBS tools

Source: Blog post by DeWitt and Stonebraker

Need for High-Level Languages

MapReduce is great for one-pass computation large-data

processing

But writing Java programs for everything is verbose and slow

Data scientists don’t want to write Java

But it is inefficient for multi-pass algorithms

No efficient primitives for data sharing and iterative tasks

State between steps goes to distributed file system

Slow due to replication & disk storage

Move it onto a cluster

In cluster setting, using of Map-Reduce and Hadoop is

slow

Hadoop writes to disk and complex

How to improve this?

iter. 1 iter. 2 . . .

Input

file system

read

file system

write

file system

read

file system

write

Input

query 1

query 2

query 3

result 1

result 2

result 3

file system

read

. . .

Commonly spend 90% of time doing I/O

Example: Iterative Apps

What we need is…

Resilient

Checkpointing

Fast, does not always store to disk

Replayable

Embarrassingly Parallel

Scala

Scala has many other nice features:

A type system that makes sense.

Traits.

Implicit conversions.

Pattern Matching.

XML literals, Parser combinators, ...

Spark: A Brief History

2004

MapReduce paper

2002

MapRe

2006

Hadoop @ Yahoo!

op-level

8

2010

Spark paper

2002 2004 2006 2008 2010 2012 2014

duce @ Google 2008 2014 Hadoop Summit Apache Spark t

Why better than hadoop?

• In-memory as opposed to disk

• Data can be cached in memory or disk for future use

• Fast: up to 100 times faster as it is using memory as opposed to disk

• Easier than Hadoop while being functional, runs a general DAG

• APIs in Java, Scala, Python, R

WordCount MapReduce vs Spark

WordCount in 50+ lines of Java MR WordCount in 3 lines of Spark

Word Count Example

text_file = sc.textFile("hdfs://... wordsL ist .tx t ")

wordcounts = text_file.flatMap(lambda line: line.split(" "))

.map(lambda word:(word, 1 ) )

.reduceByKey(lambda x , y : x+y )

. c o l l e c t ( ) )

('rat',2),('elephant',1),('cat',2)

Disk vs Memory

L1 cache reference:

L2 cache reference:

Mutex lock/unlock:

Main memory reference:

Disk seek:

0.5 ns

7 ns

100 ns

100 ns

10,000,000 ns

Network vs Local

Send 2K bytes over 1 Gbps network:

Read 1 MB sequentially from memory:

Round trip within same datacenter:

Read 1 MB sequentially from network:

Read 1 MB sequentially from disk:

Send packet CA->Netherlands->CA:

20,000 ns

250,000 ns

500,000 ns

10,000,000 ns

30,000,000 ns

150,000,000 ns

Spark Architecture

Resilient Distributed Datasets (RDDs)

RDD: Spark “primitive” representing a collection of records

Immutable

Partitioned (the D in RDD)

Transformations operate on an RDD to create another

RDD

Coarse-grained manipulations only

RDDs keep track of lineage

Persistence

RDDs can be materialized in memory or on disk

OOM or machine failures: What happens?

Fault tolerance (the R in RDD):

RDDs can always be recomputed from stable storage (disk)


Resilient:

• Recover from errors, e.g. node failure, slow processes

• Track history of each partition, re-run

Fault Tolerance

filter

file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

map reduce

Inp

ut

file

RDDs track lineage info to rebuild lost data

filter

Inp

ut

file

Fault Tolerance

file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

map reduce

RDDs track lineage info to rebuild lost data

Distributed:

• Distributed across the cluster of machines

• Divided in partitions, atomic chunks of data


Dataset:

• Data storage created from: HDFS, S3, HBase, JSON, text, Local hierarchy of folders

• Or created transforming another RDD


messages.saveAsTextFile("errors.txt") kicks off a computation


» Immutable collections of objects, spread across cluster

» Statically typed: RDD[T] has objects of type T

val sc = new SparkContext()

val lines = sc.textFile("log.txt") // RDD[String]

// Transform using standard collection operations

val errors = lines.filter(_.startsWith("ERROR"))

val messages = errors.map(_.split(‘\t’)(2)) lazily evaluated

Spark Driver and Workers

A Spark program is two programs: » A driver program and a workers program

Worker programs run on cluster nodes or in

local threads

DataFrames are distributed across workers Local

threads Cluster

manager

Worker Worker

driver program

SparkContext

sqlContext

Spark Spark executor executor

Amazon S3, HDFS, or other storage

Spark Program

1) Create some input RDDs from external data or parallelize a collection in your driver program.

2) Lazily transform them to define new RDDs using

transformations like filter() or map()

3) Ask Spark to cache() any intermediate RDDs that

will need to be reused.

4) Launch actions such as count() and collect() to

kick off a parallel computation, which is then optimized

and executed by Spark.

Operations on RDDs Transformations (lazy):

map

flatMap

filter

union/intersection

join

reduceByKey

groupByKey

…

Actions (actually trigger computations)

collect

saveAsTextFile/saveAsSequenceFile

…

Spark with Python

Deploying code to the cluster

Talking to Cluster Manager

Manager can be:

YARN

Mesos

Spark Standalone

RDD Stages Tasks

rdd1.join(rdd2) .groupBy(…) .filter(…)

build operator DAG

split graph into stages of tasks

submit each stage as ready

DAG TaskSet

launch tasks via cluster manager

retry failed or straggling tasks

Cluster manager

RDD Objects DAG Scheduler Task Scheduler Worker

execute tasks

store and serve blocks

Threads

Block manager

Task

Where does code run?

Local or Distributed? » Locally, in the driver

» Distributed at the executors

» Both at the driver and the executors

» Transformations run at executors

» Actions run at executors and

driver

Important point: » Executors run in parallel

» Executors have much more memory

Your application

(driver program)

Worker

Spark executor

Worker

Spark executor

Spark Components

MLlib algorithms

classification: logistic regression, linear SVM, naïve Bayes, classification tree

regression: generalized linear models (GLMs), regression tree

collaborative filtering: alternating least squares (ALS), non-negative matrix factorization (NMF)

clustering: k-means||

decomposition: SVD, PCA

optimization: stochastic gradient descent, L-BFGS

Machine Learning Library (MLlib)

70+ contributors

in past year

points = context.sql(“select latitude, longitude from tweets”)

model = KMeans.train(points, 10)

Spark Streaming

Spark

Spark Streaming

batches of X seconds

Run a streaming computation as a series of very small, deterministic batch jobs

live data stream

processed results

• Chop up the live stream into batches of X seconds

• Spark treats each batch of data as RDDs and processes them using RDD opera;ons

• Finally, the processed results of the RDD opera;ons are returned in batches

Spark

Spark Streaming

batches of X seconds

Run a streaming computation as a series of very small, deterministic batch jobs

live data stream

processed results

• Batch sizes as low as ½ second, latency ~ 1 second

• Poten;al for combining batch processing and streaming processing in the same system

Spark Streaming

MLlib + SQL

df = context.sql(“select latitude, longitude from tweets”)

model = pipeline.fit(df)

DataFrames in Spark 1.3 (as of March 2015)

Powerful coupled with new pipeline API

Spark SQL

// Run SQL statements

val teenagers = context.sql( "SELECT name FROM people WHERE age >= 13 AND age <= 19")

// The results of SQL queries are RDDs of Row objects

val names = teenagers.map(t => "Name: " + t(0)).collect()

GraphX

Build graph using RDDs of nodes and edges

Run standard algorithms such as PageRank

GraphX

General graph processing library

MLlib + GraphX

introduction to apache spark - big data · 2018-01-31 · introduction to apache spark bu eğitim...

Documents