introduction to apache spark - big data · 2018-01-31 · introduction to apache spark bu eğitim...
TRANSCRIPT
Introduction to Apache Spark
Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı
kapsamında yürütülmekte olan TR10/16/YNY/0036 no’lu İstanbul Big Data Eğitim ve Araştırma Merkezi Projesi
dahilinde gerçekleştirilmiştir. İçerik ile ilgili tek sorumluluk Bahçeşehir Üniversitesi’ne ait olup İSTKA veya Kalkınma
Bakanlığı’nın görüşlerini yansıtmamaktadır.
A Major Step Backwards?
MapReduce is a step backward in database access:
Schemas are good
Separation of the schema from the application is good
High-level access languages are good
MapReduce is poor implementation
Brute force and only brute force (no indexes, for example)
MapReduce is not novel
MapReduce is missing features
Bulk loader, indexing, updates, transactions…
MapReduce is incompatible with DMBS tools
Source: Blog post by DeWitt and Stonebraker
Need for High-Level Languages
MapReduce is great for one-pass computation large-data
processing
But writing Java programs for everything is verbose and slow
Data scientists don’t want to write Java
But it is inefficient for multi-pass algorithms
No efficient primitives for data sharing and iterative tasks
State between steps goes to distributed file system
Slow due to replication & disk storage
Move it onto a cluster
In cluster setting, using of Map-Reduce and Hadoop is
slow
Hadoop writes to disk and complex
How to improve this?
iter. 1 iter. 2 . . .
Input
file system
read
file system
write
file system
read
file system
write
Input
query 1
query 2
query 3
result 1
result 2
result 3
file system
read
. . .
Commonly spend 90% of time doing I/O
Example: Iterative Apps
What we need is…
Resilient
Checkpointing
Fast, does not always store to disk
Replayable
Embarrassingly Parallel
Scala
Scala has many other nice features:
A type system that makes sense.
Traits.
Implicit conversions.
Pattern Matching.
XML literals, Parser combinators, ...
Spark: A Brief History
2004
MapReduce paper
2002
MapRe
2006
Hadoop @ Yahoo!
op-level
8
2010
Spark paper
2002 2004 2006 2008 2010 2012 2014
duce @ Google 2008 2014 Hadoop Summit Apache Spark t
Why better than hadoop?
• In-memory as opposed to disk
• Data can be cached in memory or disk for future use
• Fast: up to 100 times faster as it is using memory as opposed to disk
• Easier than Hadoop while being functional, runs a general DAG
• APIs in Java, Scala, Python, R
WordCount MapReduce vs Spark
WordCount in 50+ lines of Java MR WordCount in 3 lines of Spark
Word Count Example
text_file = sc.textFile("hdfs://... wordsL ist .tx t ")
wordcounts = text_file.flatMap(lambda line: line.split(" "))
.map(lambda word:(word, 1 ) )
.reduceByKey(lambda x , y : x+y )
. c o l l e c t ( ) )
('rat',2),('elephant',1),('cat',2)
Disk vs Memory
L1 cache reference:
L2 cache reference:
Mutex lock/unlock:
Main memory reference:
Disk seek:
0.5 ns
7 ns
100 ns
100 ns
10,000,000 ns
Network vs Local
Send 2K bytes over 1 Gbps network:
Read 1 MB sequentially from memory:
Round trip within same datacenter:
Read 1 MB sequentially from network:
Read 1 MB sequentially from disk:
Send packet CA->Netherlands->CA:
20,000 ns
250,000 ns
500,000 ns
10,000,000 ns
30,000,000 ns
150,000,000 ns
Spark Architecture
Resilient Distributed Datasets (RDDs)
RDD: Spark “primitive” representing a collection of records
Immutable
Partitioned (the D in RDD)
Transformations operate on an RDD to create another
RDD
Coarse-grained manipulations only
RDDs keep track of lineage
Persistence
RDDs can be materialized in memory or on disk
OOM or machine failures: What happens?
Fault tolerance (the R in RDD):
RDDs can always be recomputed from stable storage (disk)
Resilient Distributed Datasets (RDDs)
Resilient:
• Recover from errors, e.g. node failure, slow processes
• Track history of each partition, re-run
Fault Tolerance
filter
file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)
map reduce
Inp
ut
file
RDDs track lineage info to rebuild lost data
filter
Inp
ut
file
Fault Tolerance
file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)
map reduce
RDDs track lineage info to rebuild lost data
Distributed:
• Distributed across the cluster of machines
• Divided in partitions, atomic chunks of data
Resilient Distributed Datasets (RDDs)
Dataset:
• Data storage created from: HDFS, S3, HBase, JSON, text, Local hierarchy of folders
• Or created transforming another RDD
Resilient Distributed Datasets (RDDs)
messages.saveAsTextFile("errors.txt") kicks off a computation
Resilient Distributed Datasets (RDDs)
» Immutable collections of objects, spread across cluster
» Statically typed: RDD[T] has objects of type T
val sc = new SparkContext()
val lines = sc.textFile("log.txt") // RDD[String]
// Transform using standard collection operations
val errors = lines.filter(_.startsWith("ERROR"))
val messages = errors.map(_.split(‘\t’)(2)) lazily evaluated
Spark Driver and Workers
A Spark program is two programs: » A driver program and a workers program
Worker programs run on cluster nodes or in
local threads
DataFrames are distributed across workers Local
threads Cluster
manager
Worker Worker
driver program
SparkContext
sqlContext
Spark Spark executor executor
Amazon S3, HDFS, or other storage
Spark Program
1) Create some input RDDs from external data or parallelize a collection in your driver program.
2) Lazily transform them to define new RDDs using
transformations like filter() or map()
3) Ask Spark to cache() any intermediate RDDs that
will need to be reused.
4) Launch actions such as count() and collect() to
kick off a parallel computation, which is then optimized
and executed by Spark.
Operations on RDDs Transformations (lazy):
map
flatMap
filter
union/intersection
join
reduceByKey
groupByKey
…
Actions (actually trigger computations)
collect
saveAsTextFile/saveAsSequenceFile
…
Spark with Python
Deploying code to the cluster
Talking to Cluster Manager
Manager can be:
YARN
Mesos
Spark Standalone
RDD Stages Tasks
rdd1.join(rdd2) .groupBy(…) .filter(…)
build operator DAG
split graph into stages of tasks
submit each stage as ready
DAG TaskSet
launch tasks via cluster manager
retry failed or straggling tasks
Cluster manager
RDD Objects DAG Scheduler Task Scheduler Worker
execute tasks
store and serve blocks
Threads
Block manager
Task
Where does code run?
Local or Distributed? » Locally, in the driver
» Distributed at the executors
» Both at the driver and the executors
» Transformations run at executors
» Actions run at executors and
driver
Important point: » Executors run in parallel
» Executors have much more memory
Your application
(driver program)
Worker
Spark executor
Worker
Spark executor
Spark Components
MLlib algorithms
classification: logistic regression, linear SVM, naïve Bayes, classification tree
regression: generalized linear models (GLMs), regression tree
collaborative filtering: alternating least squares (ALS), non-negative matrix factorization (NMF)
clustering: k-means||
decomposition: SVD, PCA
optimization: stochastic gradient descent, L-BFGS
Machine Learning Library (MLlib)
70+ contributors
in past year
points = context.sql(“select latitude, longitude from tweets”)
model = KMeans.train(points, 10)
Spark Streaming
Spark
Spark Streaming
batches of X seconds
Run a streaming computation as a series of very small, deterministic batch jobs
live data stream
processed results
• Chop up the live stream into batches of X seconds
• Spark treats each batch of data as RDDs and processes them using RDD opera;ons
• Finally, the processed results of the RDD opera;ons are returned in batches
Spark
Spark Streaming
batches of X seconds
Run a streaming computation as a series of very small, deterministic batch jobs
live data stream
processed results
• Batch sizes as low as ½ second, latency ~ 1 second
• Poten;al for combining batch processing and streaming processing in the same system
Spark Streaming
MLlib + SQL
df = context.sql(“select latitude, longitude from tweets”)
model = pipeline.fit(df)
DataFrames in Spark 1.3 (as of March 2015)
Powerful coupled with new pipeline API
Spark SQL
// Run SQL statements
val teenagers = context.sql( "SELECT name FROM people WHERE age >= 13 AND age <= 19")
// The results of SQL queries are RDDs of Row objects
val names = teenagers.map(t => "Name: " + t(0)).collect()
GraphX
Build graph using RDDs of nodes and edges
Run standard algorithms such as PageRank
GraphX
General graph processing library
MLlib + GraphX