cassandra2013sparktalkfinal-130618122907-phpapp01
TRANSCRIPT
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
1/128
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
2/128
Who is this guy
Sta!Engineer, Compute and Data Services, Ooyala
Building multiple web-scale real-time systems on top of C*, Kafka,Storm, etc.
Scala/Akka guy
Very excited by open source, big data projects - share some today
@evanfchan
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
3/128
Agenda
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
4/128
Agenda
Ooyala and Cassandra
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
5/128
Agenda
Ooyala and Cassandra
What problem are we trying to solve?
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
6/128
Agenda
Ooyala and Cassandra
What problem are we trying to solve?
Spark and Shark
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
7/128
Agenda
Ooyala and Cassandra
What problem are we trying to solve?
Spark and Shark
Our Spark/Cassandra Architecture
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
8/128
Agenda
Ooyala and Cassandra
What problem are we trying to solve?
Spark and Shark
Our Spark/Cassandra Architecture
Demo
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
9/128
Cassandra at Ooyala
Who is Ooyala, and how we use Cassandra
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
10/128
OOYALAPowering personalizedvideo
experiences across all screens.
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
11/128
Founded in 2007
Commercially launch in 2009
230+ employees in Silicon Valley, LA, NYC,London, Paris, Tokyo, Sydney & Guadalajara
Global footprint, 200M unique users,110+ countries, and more than 6,000 websites
Over 1 billion videos played per monthand 2 billion analytic events per day
25% of U.S. online viewers watch videopowered by Ooyala
COMPANY OVERVIEW
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
12/128
TRUSTED VIDEO PARTNER
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
13/128
We are a large Cassandra user
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
14/128
We are a large Cassandra user
11 clusters ranging in size from 3 to 36 nodes
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
15/128
We are a large Cassandra user
11 clusters ranging in size from 3 to 36 nodes
Total of 28TB of data managed over ~85 nodes
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
16/128
We are a large Cassandra user
11 clusters ranging in size from 3 to 36 nodes
Total of 28TB of data managed over ~85 nodes
Over 2 billion C* column writes per day
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
17/128
We are a large Cassandra user
11 clusters ranging in size from 3 to 36 nodes
Total of 28TB of data managed over ~85 nodes
Over 2 billion C* column writes per day
Powers all of our analytics infrastructure
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
18/128
We are a large Cassandra user
11 clusters ranging in size from 3 to 36 nodes
Total of 28TB of data managed over ~85 nodes
Over 2 billion C* column writes per day
Powers all of our analytics infrastructure
Much much bigger cluster coming soon
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
19/128
What problem are we trying tosolve?
Lots of data, complex queries, answered really quickly... but how??
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
20/128
From mountains of useless data...
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
21/128
To nuggets of truth...
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
22/128
To nuggets of truth...
Quickly
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
23/128
To nuggets of truth...
Quickly
Painlessly
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
24/128
To nuggets of truth...
Quickly
Painlessly
At scale?
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
25/128
Today: Precomputed aggregates
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
26/128
Today: Precomputed aggregates
Video metrics computed along several high cardinality dimensions
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
27/128
Today: Precomputed aggregates
Video metrics computed along several high cardinality dimensions
Very fast lookups, but inflexible, and hard to change
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
28/128
Today: Precomputed aggregates
Video metrics computed along several high cardinality dimensions
Very fast lookups, but inflexible, and hard to change
Most computed aggregates are never read
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
29/128
Today: Precomputed aggregates
Video metrics computed along several high cardinality dimensions
Very fast lookups, but inflexible, and hard to change
Most computed aggregates are never read
What if we need more dynamic queries?
Top content for mobile users in France
Engagement curves for users who watched recommendations Data mining, trends, machine learning
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
30/128
The static - dynamic continuum
Super fast lookups
Inflexible, wasteful
Best for 80% mostcommon queries
Always compute resultsfrom raw data
Flexible but slow
100% Precomputation 100% Dynamic
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
31/128
The static - dynamic continuum
Super fast lookups
Inflexible, wasteful
Best for 80% mostcommon queries
Always compute resultsfrom raw data
Flexible but slow
100% Precomputation100% Dynamic
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
32/128
Where we want to be
Partly dynamic
Pre-aggregate mostcommon queries
Flexible, fast dynamicqueries
Easily generate manymaterialized views
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
33/128
Industry Trends
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
34/128
Industry Trends
Fast execution frameworks
Impala
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
35/128
Industry Trends
Fast execution frameworks
Impala
In-memory databases
VoltDB, Druid
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
36/128
Industry Trends
Fast execution frameworks
Impala
In-memory databases
VoltDB, Druid
Streaming and real-time
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
37/128
Industry Trends
Fast execution frameworks
Impala
In-memory databases
VoltDB, Druid
Streaming and real-time
Higher-level, productive data frameworks Cascading, Hive, Pig
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
38/128
Why Spark and Shark?
Lightning-fast in-memory cluster computing
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
39/128
Introduction to Spark
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
40/128
Introduction to Spark
In-memory distributed computing framework
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
41/128
Introduction to Spark
In-memory distributed computing framework
Created by UC Berkeley AMP Lab in 2010
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
42/128
Introduction to Spark
In-memory distributed computing framework
Created by UC Berkeley AMP Lab in 2010
Targeted problems that MR is bad at:
Iterative algorithms (machine learning)
Interactive data mining
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
43/128
Introduction to Spark
In-memory distributed computing framework
Created by UC Berkeley AMP Lab in 2010
Targeted problems that MR is bad at:
Iterative algorithms (machine learning)
Interactive data mining
More general purpose than Hadoop MR
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
44/128
Introduction to Spark
In-memory distributed computing framework
Created by UC Berkeley AMP Lab in 2010
Targeted problems that MR is bad at:
Iterative algorithms (machine learning)
Interactive data mining
More general purpose than Hadoop MR
Active contributions from ~ 15 companies
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
45/128
HDFS
Map
Reduce
Map
Reduce
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
46/128
HDFS
Map
Reduce
Map
Reduce
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
47/128
HDFS
Map
Reduce
Map
Reduce
Data Source
map()
join()
Sour
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
48/128
HDFS
Map
Reduce
Map
Reduce
Data Source
map()
join()
Sour
cache()
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
49/128
HDFS
Map
Reduce
Map
Reduce
Data Source
map()
join()
Sour
cache()
transform
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
50/128
Throughput: Memory is king
6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
51/128
Throughput: Memory is king
0 37500 75000 112500 15000
C*, cold cache
C*, warm cache
Spark RDD
6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
52/128
Throughput: Memory is king
0 37500 75000 112500 15000
C*, cold cache
C*, warm cache
Spark RDD
6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
53/128
Throughput: Memory is king
0 37500 75000 112500 15000
C*, cold cache
C*, warm cache
Spark RDD
6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
54/128
Throughput: Memory is king
0 37500 75000 112500 15000
C*, cold cache
C*, warm cache
Spark RDD
6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
55/128
Developers love it
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
56/128
Developers love it
I wrote my first aggregation job in 30 minutes
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
57/128
Developers love it
I wrote my first aggregation job in 30 minutes
High level distributed collections API
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
58/128
Developers love it
I wrote my first aggregation job in 30 minutes
High level distributed collections API
No Hadoop cruft
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
59/128
Developers love it
I wrote my first aggregation job in 30 minutes
High level distributed collections API
No Hadoop cruft
Full power of Scala, Java, Python
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
60/128
Developers love it
I wrote my first aggregation job in 30 minutes
High level distributed collections API
No Hadoop cruft
Full power of Scala, Java, Python
Interactive REPL shell
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
61/128
Developers love it
I wrote my first aggregation job in 30 minutes
High level distributed collections API
No Hadoop cruft
Full power of Scala, Java, Python
Interactive REPL shell
EASY testing!!
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
62/128
Developers love it
I wrote my first aggregation job in 30 minutes
High level distributed collections API
No Hadoop cruft
Full power of Scala, Java, Python
Interactive REPL shell
EASY testing!!
Low latency - quick development cycles
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
63/128
Spark word count example
1packageorg.myorg;
2
3importjava.io.IOException;
4importjava.util.*;
5
6importorg.apache.hadoop.fs.Path;
7importorg.apache.hadoop.conf.*;
8importorg.apache.hadoop.io.*;
9importorg.apache.hadoop.mapreduce.*;
10importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat;
11importorg.apache.hadoop.mapreduce.lib.input.TextInputFormat;
12importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
13importorg.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
14
15publicclassWordCount {
16
17 publicstaticclassMap extendsMapper
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
64/128
Spark word count example
file = spark.textFile("hdfs://...")
file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
1packageorg.myorg;
2
3importjava.io.IOException;
4importjava.util.*;
5
6importorg.apache.hadoop.fs.Path;
7importorg.apache.hadoop.conf.*;
8importorg.apache.hadoop.io.*;
9importorg.apache.hadoop.mapreduce.*;
10importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat;
11importorg.apache.hadoop.mapreduce.lib.input.TextInputFormat;
12importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
13importorg.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
14
15publicclassWordCount {
16
17 publicstaticclassMap extendsMapper
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
65/128
The Spark Ecosystem
Spark
Tachyon- in-memory caching DFS
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
66/128
The Spark Ecosystem
Bagel-Pregel onSpark
Spark
Tachyon- in-memory caching DFS
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
67/128
The Spark Ecosystem
Bagel-Pregel onSpark
HIVE on Spark
Spark
Tachyon- in-memory caching DFS
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
68/128
The Spark Ecosystem
Bagel-Pregel onSpark
HIVE on Spark
Spark Streaming-discretized streamprocessing
Spark
Tachyon- in-memory caching DFS
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
69/128
Shark - HIVE on Spark
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
70/128
Shark - HIVE on Spark
100% HiveQL compatible
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
71/128
Shark - HIVE on Spark
100% HiveQL compatible
10-100x faster than HIVE, answers in seconds
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
72/128
Shark - HIVE on Spark
100% HiveQL compatible
10-100x faster than HIVE, answers in seconds
Reuse UDFs, SerDes, StorageHandlers
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
73/128
Shark - HIVE on Spark
100% HiveQL compatible
10-100x faster than HIVE, answers in seconds
Reuse UDFs, SerDes, StorageHandlers
Can use DSE / CassandraFS for Metastore
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
74/128
Shark - HIVE on Spark
100% HiveQL compatible
10-100x faster than HIVE, answers in seconds
Reuse UDFs, SerDes, StorageHandlers
Can use DSE / CassandraFS for Metastore
Easy Scala/Java integration via Spark - easier thanwriting UDFs
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
75/128
Our new analytics architecture
How we integrate Cassandra and Spark/Shark
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
76/128
From raw events to fast queries
RawEvents
RawEvents
RawEvents
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
77/128
From raw events to fast queries
IngestionC*
eventstore
RawEvents
RawEvents
RawEvents
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
78/128
From raw events to fast queries
IngestionC*
eventstore
RawEvents
RawEvents
RawEvents
Spark
Spark
Spark
View 1
View 2
View 3
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
79/128
From raw events to fast queries
IngestionC*
eventstore
RawEvents
RawEvents
RawEvents
Spark
Spark
Spark
View 1
View 2
View 3
SparkPq
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
80/128
From raw events to fast queries
IngestionC*
eventstore
RawEvents
RawEvents
RawEvents
Spark
Spark
Spark
View 1
View 2
View 3
Spark
Shark
Pq
AH
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
81/128
Our Spark/Shark/Cassandra Stack
Node1
Cassandra
Node2
Cassandra
Nod
Cassa
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
82/128
Our Spark/Shark/Cassandra Stack
Node1
Cassandra
InputFormat
SerDe
Node2
Cassandra
InputFormat
SerDe
Nod
Cassa
InputF
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
83/128
Our Spark/Shark/Cassandra Stack
Node1
Cassandra
InputFormat
SerDe
SparkWorker
Shark
Node2
Cassandra
InputFormat
SerDe
SparkWorker
Shark
Nod
Cassa
InputF
SparkWorker
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
84/128
Our Spark/Shark/Cassandra Stack
Node1
Cassandra
InputFormat
SerDe
SparkWorker
Shark
Node2
Cassandra
InputFormat
SerDe
SparkWorker
Shark
Nod
Cassa
InputF
SparkWorker
Spark Master
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
85/128
Our Spark/Shark/Cassandra Stack
Node1
Cassandra
InputFormat
SerDe
SparkWorker
Shark
Node2
Cassandra
InputFormat
SerDe
SparkWorker
Shark
Nod
Cassa
InputF
SparkWorker
Spark Master Job Server
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
86/128
Event Store Cassandra schema
t0 t1 t2 t3 t4
2013-04-05T00:00Z#id1
{event0:
a0}
{event1:
a1}
{event2:
a2}
{event3:
a3}
{even
a4}
Event CF
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
87/128
Event Store Cassandra schema
t0 t1 t2 t3 t4
2013-04-05T00:00Z#id1
{event0:
a0}
{event1:
a1}
{event2:
a2}
{event3:
a3}
{even
a4}
ipaddr:10.20.30.40:t1 videoId:45678:t1 providerId:50
2013-04-05T00:00Z#id1
Event CF
EventAttr CF
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
88/128
Unpacking raw events
t0 t12013-04-05T00:00Z#id1
{video: 10,
type:5}
{video: 11,
type:1}
2013-04-05T00:00Z#id2
{video: 20,
type:5}
{video: 25,
type:9}
UserID Video Tyid1 10 5
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
89/128
Unpacking raw events
t0 t12013-04-05T00:00Z#id1
{video: 10,
type:5}
{video: 11,
type:1}
2013-04-05T00:00Z#id2
{video: 20,
type:5}
{video: 25,
type:9}
UserID Video Tyid1 10 5
id1 11 1
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
90/128
Unpacking raw events
t0 t12013-04-05T00:00Z#id1
{video: 10,
type:5}
{video: 11,
type:1}
2013-04-05T00:00Z#id2
{video: 20,
type:5}
{video: 25,
type:9}
UserID Video Tyid1 10 5
id1 11 1
id2 20 5
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
91/128
Unpacking raw events
t0 t12013-04-05T00:00Z#id1
{video: 10,
type:5}
{video: 11,
type:1}
2013-04-05T00:00Z#id2
{video: 20,
type:5}
{video: 25,
type:9}
UserID Video Tyid1 10 5
id1 11 1
id2 20 5
id2 25 9
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
92/128
Tips for InputFormat Development
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
93/128
Tips for InputFormat Development
Know which target platforms you are developing for
Which API to write against? New? Old? Both?
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
94/128
Tips for InputFormat Development
Know which target platforms you are developing for
Which API to write against? New? Old? Both?
Be prepared to spend time tuning your split computation
Low latency jobs require fast splits
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
95/128
Tips for InputFormat Development
Know which target platforms you are developing for
Which API to write against? New? Old? Both?
Be prepared to spend time tuning your split computation
Low latency jobs require fast splits
Consider sorting row keys by token for data locality
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
96/128
Tips for InputFormat Development
Know which target platforms you are developing for
Which API to write against? New? Old? Both?
Be prepared to spend time tuning your split computation
Low latency jobs require fast splits
Consider sorting row keys by token for data locality
Implement predicate pushdown for HIVE SerDes
Use your indexes to reduce size of dataset
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
97/128
Example: OLAP processing
t0
2013-04
-05T00:
00Z#id1
{video:
10,
type:5}
2013-04
-05T00:00Z#id2
{video:
20,type:5}
C* events
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
98/128
Example: OLAP processing
t0
2013-04
-05T00:
00Z#id1
{video:
10,
type:5}
2013-04
-05T00:00Z#id2
{video:
20,type:5}
C* events
OLAPAggregates
OLAPAggregates
OLAPAggregates
Cached Materialized Views
Spark
Spark
Spark
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
99/128
Example: OLAP processing
t0
2013-04
-05T00:
00Z#id1
{video:
10,
type:5}
2013-04
-05T00:00Z#id2
{video:
20,type:5}
C* events
OLAPAggregates
OLAPAggregates
OLAPAggregates
Cached Materialized Views
Spark
Spark
Spark
Union
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
100/128
Example: OLAP processing
t0
2013-04
-05T00:
00Z#id1
{video:
10,
type:5}
2013-04
-05T00:00Z#id2
{video:
20,type:5}
C* events
OLAPAggregates
OLAPAggregates
OLAPAggregates
Cached Materialized Views
Spark
Spark
Spark
Union
Query 1:by Provid
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
101/128
Example: OLAP processing
t0
2013-04
-05T00:
00Z#id1
{video:
10,
type:5}
2013-04
-05T00:00Z#id2
{video:
20,type:5}
C* events
OLAPAggregates
OLAPAggregates
OLAPAggregates
Cached Materialized Views
Spark
Spark
Spark
Union
Query 1:by Provid
Query 2:
content mobile
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
102/128
Performance numbers
6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
103/128
Performance numbers
Spark: C* -> OLAP aggregatescold cache, 1.4 million events
130 seconds
6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
104/128
Performance numbers
Spark: C* -> OLAP aggregatescold cache, 1.4 million events
130 seconds
C* -> OLAP aggregateswarmed cache
20-30 seconds
6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
105/128
Performance numbers
Spark: C* -> OLAP aggregatescold cache, 1.4 million events
130 seconds
C* -> OLAP aggregateswarmed cache
20-30 seconds
OLAP aggregate query via Spark
(56k records)
60 ms
6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
106/128
OLAP WorkFlow
Aggregation JobSparkExecutors
REST Job Server
Aggregate
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
107/128
OLAP WorkFlow
Aggregation JobSparkExecutors
Cassandra
REST Job Server
Aggregate
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
108/128
OLAP WorkFlow
DatasetAggregation JobSparkExecutors
Cassandra
REST Job Server
Aggregate
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
109/128
OLAP WorkFlow
DatasetAggregation Job Query JobSparkExecutors
Cassandra
REST Job Server
Aggregate Query
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
110/128
OLAP WorkFlow
DatasetAggregation Job Query JobSparkExecutors
Cassandra
REST Job Server
Aggregate Query
Result
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
111/128
OLAP WorkFlow
DatasetAggregation Job Query JobSparkExecutors
Cassandra
REST Job Server
Query Job
Aggregate Query
Result
Query
Result
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
112/128
Fault Tolerance
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
113/128
Fault Tolerance
Cached dataset lives in Java Heap only - what if process dies?
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
114/128
Fault Tolerance
Cached dataset lives in Java Heap only - what if process dies?
Spark lineage - automatic recomputation from source, but this isexpensive!
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
115/128
Fault Tolerance
Cached dataset lives in Java Heap only - what if process dies?
Spark lineage - automatic recomputation from source, but this isexpensive!
Can also replicate cached dataset to survive single node failures
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
116/128
Fault Tolerance
Cached dataset lives in Java Heap only - what if process dies?
Spark lineage - automatic recomputation from source, but this isexpensive!
Can also replicate cached dataset to survive single node failures
Persist materialized views back to C*, then load into cache -- nowrecovery path is much faster
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
117/128
Fault Tolerance
Cached dataset lives in Java Heap only - what if process dies?
Spark lineage - automatic recomputation from source, but this isexpensive!
Can also replicate cached dataset to survive single node failures
Persist materialized views back to C*, then load into cache -- nowrecovery path is much faster
Persistence also enables multiple processes to hold cached dataset
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
118/128
Demo time
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
119/128
Shark Demo
Local shark node, 1 core, MBP
How to create a table from C* using our inputformat
Creating a cached Shark table
Running fast queries
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
120/128
Creating a Shark Table from InputFormat
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
121/128
Creating a cached table
Tuesday, June 18, 13
Querying cached table
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
122/128
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
123/128
THANK YOU
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
124/128
THANK YOU
@evanfchan
Tuesday, June 18, 13
-
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
125/128
THANK YOU
@evanfchan
Tuesday, June 18, 13
mailto:[email protected]:[email protected] -
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
126/128
THANK YOU
@evanfchan
Tuesday, June 18, 13
mailto:[email protected]:[email protected] -
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
127/128
THANK YOU
@evanfchan
WE ARE HIRING!!
Tuesday, June 18, 13
mailto:[email protected]:[email protected] -
8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01
128/128
Spark: Under the hood
Map DatasetReduce Map
Driver Map DatasetReduce Map
Map DatasetReduce Map
One executor process per node