apache spark 2.0: faster, easier, and smarter
Post on 06-Jan-2017
3.200 Views
Preview:
TRANSCRIPT
Apache Spark 2.0: Faster, Easier, and Smarter
Reynold Xin@rxin2016-05-05 Webinar
About Databricks
Founded by creators of Spark in 2013
Cloud enterprise data platform- Managed Spark clusters- Interactive data science- Production pipelines- Data governance, security, …
What is Apache Spark?
Unified engine across data workloads and platforms
…
SQLStreaming ML Graph Batch …
A slide from 2013 …
Spark 2.0Steps to bigger & better things….
Builds on all we learned in past 2 years
Versioning in Spark
In reality, we hate breaking APIs!Will not do so except for dependency conflicts (e.g. Guava) and experimental APIs
1 .6.0Patch version (only bug fixes)
Major version (may change APIs)
Minor version (adds APIs / features)
Major Features in 2.0
Tungsten Phase 2speedups of 5-20x
Structured Streaming SQL 2003& Unifying Datasets
and DataFrames
API Foundation for the Future
Dataset, DataFrame, SQL, ML
Towards SQL 2003
As of this week, Spark branch-2.0 can run all 99 TPC-DS queries!
- New standard compliant parser (with good error messages!)- Subqueries (correlated & uncorrelated)- Approximate aggregate stats
Datasets and DataFrames
In 2015, we added DataFrames & Datasets as structured data APIs• DataFrames are collections of rows with a schema• Datasets add static types, e.g. Dataset[Person]• Both run on Tungsten
Spark 2.0 will merge these APIs: DataFrame = Dataset[Row]
SparkSession – a new entry point
SparkSession is the “SparkContext” for Dataset/DataFrame
- Entry point for reading data- Working with metadata- Configuration- Cluster resource management
Notebook demo
http://bit.ly/1SMPEzQ
and
http://bit.ly/1OeqdSn
Long-Term
RDD will remain the low-level API in Spark
Datasets & DataFrames give richer semantics and optimizations• New libraries will increasingly use these as interchange format• Examples: Structured Streaming, MLlib, GraphFrames
Other notable API improvements
DataFrame-based ML pipeline API becoming the main MLlib API
ML model & pipeline persistence with almost complete coverage• In all programming languages: Scala, Java, Python, R
Improved R support• (Parallelizable) User-defined functions in R• Generalized Linear Models (GLMs), Naïve Bayes, Survival Regression, K-Means
Structured Streaming
How do we simplify streaming?
Background
Real-time processing is vital for streaming analytics
Apps need a combination: batch & interactive queries• Track state using a stream, then run SQL queries• Train an ML model offline, then update it
Integration Example
Streaming engine
Stream(home.html, 10:08)
(product.html, 10:09)
(home.html, 10:10)
. . .
What can go wrong?• Late events• Partial outputs to MySQL• State recovery on failure• Distributed reads/writes • ...
MySQL
Page Minute Visits
home 10:09 21
pricing 10:10 30
... ... ...
ProcessingBusiness logic change & new ops
(windows, sessions)
Complex Programming Models
OutputHow do we define
output over time & correctness?
DataLate arrival, varying distribution over time, …
The simplest way to perform streaming analyticsis not having to reason about streaming.
Spark 2.0Infinite DataFrames
Spark 1.3Static DataFrames
Single API !
logs = ctx.read.format("json").open("s3://logs")
logs.groupBy(logs.user_id).agg(sum(logs.time))
.write.format("jdbc")
.save("jdbc:mysql//...")
Example: Batch Aggregation
logs = ctx.read.format("json").stream("s3://logs")
logs.groupBy(logs.user_id).agg(sum(logs.time))
.write.format("jdbc")
.startStream("jdbc:mysql//...")
Example: Continuous Aggregation
Structured Streaming
High-level streaming API built on Spark SQL engine• Declarative API that extends DataFrames / Datasets• Event time, windowing, sessions, sources & sinks
Support interactive & batch queries• Aggregate data in a stream, then serve using JDBC• Change queries at runtime• Build and apply ML models Not just streaming, but
“continuous applications”
Goal: end-to-end continuous applications
Example
Reporting Applications
ML Model
Ad-hoc Queries
Traditional streamingOther processing types
Kafka DatabaseETL
Tungsten Phase 2
Can we speed up Spark by 10X?
Demo
http://bit.ly/1X8LKmH
Going back to the fundamentals
Difficult to get order of magnitude performance speed ups with profiling techniques• For 10x improvement, would need to find top hotspots that add up to 90%
and make them instantaneous• For 100x, 99%
Instead, look bottom up, how fast should it run?
Scan
Filter
Project
Aggregate
select count(*) from store_saleswhere ss_item_sk = 1000
Volcano Iterator Model
Standard for 30 years: almost all databases do it
Each operator is an “iterator” that consumes records from its input operator
class Filter {def next(): Boolean = {var found = falsewhile (!found && child.next()) {
found = predicate(child.fetch())}return found
}
def fetch(): InternalRow = {child.fetch()
}…
}
What if we hire a college freshman toimplement this query in Java in 10 mins?
select count(*) from store_saleswhere ss_item_sk = 1000
var count = 0for (ss_item_sk in store_sales) {
if (ss_item_sk == 1000) {count += 1
}}
Volcano model30+ years of database research
college freshmanhand-written code in 10 mins
vs
Volcano 13.95 millionrows/sec
collegefreshman
125 millionrows/sec
Note: End-to-end, single thread, single column, and data originated in Parquet on disk
High throughput
How does a student beat 30 years of research?
Volcano
1. Many virtual function calls
2. Data in memory (or cache)
3. No loop unrolling, SIMD, pipelining
hand-written code
1. No virtual function calls
2. Data in CPU registers
3. Compiler loop unrolling, SIMD, pipelining
Take advantage of all the information that is known after query compilation
Scan
Filter
Project
Aggregate
long count = 0;for (ss_item_sk in store_sales) {
if (ss_item_sk == 1000) {count += 1;
}}
Tungsten Phase 2: Spark as a “Compiler”
Functionality of a general purpose execution engine; performance as if hand built system just to run your query
Performance of Core Primitives
cost per row (single thread)
primitive Spark 1.6 Spark 2.0
filter 15 ns 1.1 ns
sum w/o group 14 ns 0.9 ns
sum w/ group 79 ns 10.7 ns
hash join 115 ns 4.0 ns
sort (8 bit entropy) 620 ns 5.3 ns
sort (64 bit entropy) 620 ns 40 ns
sort-merge join 750 ns 700 ns
Intel Haswell i7 4960HQ 2.6GHz, HotSpot 1.8.0_60-b27, Mac OS X 10.11
0
100
200
300
400
500
600
Runt
ime
(sec
onds
)Preliminary TPC-DS Spark 2.0 vs 1.6 – Lower is Better
Time (1.6)
Time (2.0)
DatabricksCommunity Edition
Best place to try & learn Spark.
Release Schedule
Today: work-in-progress source code available on GitHub
Next week: preview of Spark 2.0 in Databricks Community Edition
Early June: Apache Spark 2.0 GA
Today’s talk
Spark 2.0 doubles down on what made Spark attractive:• Faster: Project Tungsten Phase 2, i.e. “Spark as a compiler”• Easier: unified APIs & SQL 2003• Smarter: Structured Streaming• Only scratched the surface here, as Spark 2.0 will resolve ~ 2000 tickets.
Learn Spark on Databricks Community Edition• join beta waitlist https://databricks.com/ce/
Discount code: Meetup16SF
Thank you.Don’t forget to register for Spark Summit SF!
top related