cassandra2013sparktalkfinal-130618122907-phpapp01

Upload: linux87s

Post on 01-Jun-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    1/128

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    2/128

    Who is this guy

    Sta!Engineer, Compute and Data Services, Ooyala

    Building multiple web-scale real-time systems on top of C*, Kafka,Storm, etc.

    Scala/Akka guy

    Very excited by open source, big data projects - share some today

    @evanfchan

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    3/128

    Agenda

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    4/128

    Agenda

    Ooyala and Cassandra

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    5/128

    Agenda

    Ooyala and Cassandra

    What problem are we trying to solve?

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    6/128

    Agenda

    Ooyala and Cassandra

    What problem are we trying to solve?

    Spark and Shark

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    7/128

    Agenda

    Ooyala and Cassandra

    What problem are we trying to solve?

    Spark and Shark

    Our Spark/Cassandra Architecture

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    8/128

    Agenda

    Ooyala and Cassandra

    What problem are we trying to solve?

    Spark and Shark

    Our Spark/Cassandra Architecture

    Demo

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    9/128

    Cassandra at Ooyala

    Who is Ooyala, and how we use Cassandra

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    10/128

    OOYALAPowering personalizedvideo

    experiences across all screens.

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    11/128

    Founded in 2007

    Commercially launch in 2009

    230+ employees in Silicon Valley, LA, NYC,London, Paris, Tokyo, Sydney & Guadalajara

    Global footprint, 200M unique users,110+ countries, and more than 6,000 websites

    Over 1 billion videos played per monthand 2 billion analytic events per day

    25% of U.S. online viewers watch videopowered by Ooyala

    COMPANY OVERVIEW

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    12/128

    TRUSTED VIDEO PARTNER

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    13/128

    We are a large Cassandra user

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    14/128

    We are a large Cassandra user

    11 clusters ranging in size from 3 to 36 nodes

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    15/128

    We are a large Cassandra user

    11 clusters ranging in size from 3 to 36 nodes

    Total of 28TB of data managed over ~85 nodes

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    16/128

    We are a large Cassandra user

    11 clusters ranging in size from 3 to 36 nodes

    Total of 28TB of data managed over ~85 nodes

    Over 2 billion C* column writes per day

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    17/128

    We are a large Cassandra user

    11 clusters ranging in size from 3 to 36 nodes

    Total of 28TB of data managed over ~85 nodes

    Over 2 billion C* column writes per day

    Powers all of our analytics infrastructure

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    18/128

    We are a large Cassandra user

    11 clusters ranging in size from 3 to 36 nodes

    Total of 28TB of data managed over ~85 nodes

    Over 2 billion C* column writes per day

    Powers all of our analytics infrastructure

    Much much bigger cluster coming soon

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    19/128

    What problem are we trying tosolve?

    Lots of data, complex queries, answered really quickly... but how??

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    20/128

    From mountains of useless data...

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    21/128

    To nuggets of truth...

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    22/128

    To nuggets of truth...

    Quickly

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    23/128

    To nuggets of truth...

    Quickly

    Painlessly

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    24/128

    To nuggets of truth...

    Quickly

    Painlessly

    At scale?

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    25/128

    Today: Precomputed aggregates

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    26/128

    Today: Precomputed aggregates

    Video metrics computed along several high cardinality dimensions

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    27/128

    Today: Precomputed aggregates

    Video metrics computed along several high cardinality dimensions

    Very fast lookups, but inflexible, and hard to change

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    28/128

    Today: Precomputed aggregates

    Video metrics computed along several high cardinality dimensions

    Very fast lookups, but inflexible, and hard to change

    Most computed aggregates are never read

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    29/128

    Today: Precomputed aggregates

    Video metrics computed along several high cardinality dimensions

    Very fast lookups, but inflexible, and hard to change

    Most computed aggregates are never read

    What if we need more dynamic queries?

    Top content for mobile users in France

    Engagement curves for users who watched recommendations Data mining, trends, machine learning

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    30/128

    The static - dynamic continuum

    Super fast lookups

    Inflexible, wasteful

    Best for 80% mostcommon queries

    Always compute resultsfrom raw data

    Flexible but slow

    100% Precomputation 100% Dynamic

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    31/128

    The static - dynamic continuum

    Super fast lookups

    Inflexible, wasteful

    Best for 80% mostcommon queries

    Always compute resultsfrom raw data

    Flexible but slow

    100% Precomputation100% Dynamic

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    32/128

    Where we want to be

    Partly dynamic

    Pre-aggregate mostcommon queries

    Flexible, fast dynamicqueries

    Easily generate manymaterialized views

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    33/128

    Industry Trends

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    34/128

    Industry Trends

    Fast execution frameworks

    Impala

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    35/128

    Industry Trends

    Fast execution frameworks

    Impala

    In-memory databases

    VoltDB, Druid

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    36/128

    Industry Trends

    Fast execution frameworks

    Impala

    In-memory databases

    VoltDB, Druid

    Streaming and real-time

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    37/128

    Industry Trends

    Fast execution frameworks

    Impala

    In-memory databases

    VoltDB, Druid

    Streaming and real-time

    Higher-level, productive data frameworks Cascading, Hive, Pig

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    38/128

    Why Spark and Shark?

    Lightning-fast in-memory cluster computing

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    39/128

    Introduction to Spark

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    40/128

    Introduction to Spark

    In-memory distributed computing framework

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    41/128

    Introduction to Spark

    In-memory distributed computing framework

    Created by UC Berkeley AMP Lab in 2010

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    42/128

    Introduction to Spark

    In-memory distributed computing framework

    Created by UC Berkeley AMP Lab in 2010

    Targeted problems that MR is bad at:

    Iterative algorithms (machine learning)

    Interactive data mining

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    43/128

    Introduction to Spark

    In-memory distributed computing framework

    Created by UC Berkeley AMP Lab in 2010

    Targeted problems that MR is bad at:

    Iterative algorithms (machine learning)

    Interactive data mining

    More general purpose than Hadoop MR

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    44/128

    Introduction to Spark

    In-memory distributed computing framework

    Created by UC Berkeley AMP Lab in 2010

    Targeted problems that MR is bad at:

    Iterative algorithms (machine learning)

    Interactive data mining

    More general purpose than Hadoop MR

    Active contributions from ~ 15 companies

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    45/128

    HDFS

    Map

    Reduce

    Map

    Reduce

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    46/128

    HDFS

    Map

    Reduce

    Map

    Reduce

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    47/128

    HDFS

    Map

    Reduce

    Map

    Reduce

    Data Source

    map()

    join()

    Sour

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    48/128

    HDFS

    Map

    Reduce

    Map

    Reduce

    Data Source

    map()

    join()

    Sour

    cache()

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    49/128

    HDFS

    Map

    Reduce

    Map

    Reduce

    Data Source

    map()

    join()

    Sour

    cache()

    transform

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    50/128

    Throughput: Memory is king

    6-node C*/DSE 1.1.9 cluster,

    Spark 0.7.0

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    51/128

    Throughput: Memory is king

    0 37500 75000 112500 15000

    C*, cold cache

    C*, warm cache

    Spark RDD

    6-node C*/DSE 1.1.9 cluster,

    Spark 0.7.0

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    52/128

    Throughput: Memory is king

    0 37500 75000 112500 15000

    C*, cold cache

    C*, warm cache

    Spark RDD

    6-node C*/DSE 1.1.9 cluster,

    Spark 0.7.0

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    53/128

    Throughput: Memory is king

    0 37500 75000 112500 15000

    C*, cold cache

    C*, warm cache

    Spark RDD

    6-node C*/DSE 1.1.9 cluster,

    Spark 0.7.0

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    54/128

    Throughput: Memory is king

    0 37500 75000 112500 15000

    C*, cold cache

    C*, warm cache

    Spark RDD

    6-node C*/DSE 1.1.9 cluster,

    Spark 0.7.0

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    55/128

    Developers love it

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    56/128

    Developers love it

    I wrote my first aggregation job in 30 minutes

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    57/128

    Developers love it

    I wrote my first aggregation job in 30 minutes

    High level distributed collections API

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    58/128

    Developers love it

    I wrote my first aggregation job in 30 minutes

    High level distributed collections API

    No Hadoop cruft

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    59/128

    Developers love it

    I wrote my first aggregation job in 30 minutes

    High level distributed collections API

    No Hadoop cruft

    Full power of Scala, Java, Python

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    60/128

    Developers love it

    I wrote my first aggregation job in 30 minutes

    High level distributed collections API

    No Hadoop cruft

    Full power of Scala, Java, Python

    Interactive REPL shell

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    61/128

    Developers love it

    I wrote my first aggregation job in 30 minutes

    High level distributed collections API

    No Hadoop cruft

    Full power of Scala, Java, Python

    Interactive REPL shell

    EASY testing!!

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    62/128

    Developers love it

    I wrote my first aggregation job in 30 minutes

    High level distributed collections API

    No Hadoop cruft

    Full power of Scala, Java, Python

    Interactive REPL shell

    EASY testing!!

    Low latency - quick development cycles

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    63/128

    Spark word count example

    1packageorg.myorg;

    2

    3importjava.io.IOException;

    4importjava.util.*;

    5

    6importorg.apache.hadoop.fs.Path;

    7importorg.apache.hadoop.conf.*;

    8importorg.apache.hadoop.io.*;

    9importorg.apache.hadoop.mapreduce.*;

    10importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat;

    11importorg.apache.hadoop.mapreduce.lib.input.TextInputFormat;

    12importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

    13importorg.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

    14

    15publicclassWordCount {

    16

    17 publicstaticclassMap extendsMapper

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    64/128

    Spark word count example

    file = spark.textFile("hdfs://...")

    file.flatMap(line => line.split(" "))

    .map(word => (word, 1))

    .reduceByKey(_ + _)

    1packageorg.myorg;

    2

    3importjava.io.IOException;

    4importjava.util.*;

    5

    6importorg.apache.hadoop.fs.Path;

    7importorg.apache.hadoop.conf.*;

    8importorg.apache.hadoop.io.*;

    9importorg.apache.hadoop.mapreduce.*;

    10importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat;

    11importorg.apache.hadoop.mapreduce.lib.input.TextInputFormat;

    12importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

    13importorg.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

    14

    15publicclassWordCount {

    16

    17 publicstaticclassMap extendsMapper

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    65/128

    The Spark Ecosystem

    Spark

    Tachyon- in-memory caching DFS

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    66/128

    The Spark Ecosystem

    Bagel-Pregel onSpark

    Spark

    Tachyon- in-memory caching DFS

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    67/128

    The Spark Ecosystem

    Bagel-Pregel onSpark

    HIVE on Spark

    Spark

    Tachyon- in-memory caching DFS

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    68/128

    The Spark Ecosystem

    Bagel-Pregel onSpark

    HIVE on Spark

    Spark Streaming-discretized streamprocessing

    Spark

    Tachyon- in-memory caching DFS

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    69/128

    Shark - HIVE on Spark

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    70/128

    Shark - HIVE on Spark

    100% HiveQL compatible

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    71/128

    Shark - HIVE on Spark

    100% HiveQL compatible

    10-100x faster than HIVE, answers in seconds

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    72/128

    Shark - HIVE on Spark

    100% HiveQL compatible

    10-100x faster than HIVE, answers in seconds

    Reuse UDFs, SerDes, StorageHandlers

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    73/128

    Shark - HIVE on Spark

    100% HiveQL compatible

    10-100x faster than HIVE, answers in seconds

    Reuse UDFs, SerDes, StorageHandlers

    Can use DSE / CassandraFS for Metastore

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    74/128

    Shark - HIVE on Spark

    100% HiveQL compatible

    10-100x faster than HIVE, answers in seconds

    Reuse UDFs, SerDes, StorageHandlers

    Can use DSE / CassandraFS for Metastore

    Easy Scala/Java integration via Spark - easier thanwriting UDFs

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    75/128

    Our new analytics architecture

    How we integrate Cassandra and Spark/Shark

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    76/128

    From raw events to fast queries

    RawEvents

    RawEvents

    RawEvents

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    77/128

    From raw events to fast queries

    IngestionC*

    eventstore

    RawEvents

    RawEvents

    RawEvents

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    78/128

    From raw events to fast queries

    IngestionC*

    eventstore

    RawEvents

    RawEvents

    RawEvents

    Spark

    Spark

    Spark

    View 1

    View 2

    View 3

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    79/128

    From raw events to fast queries

    IngestionC*

    eventstore

    RawEvents

    RawEvents

    RawEvents

    Spark

    Spark

    Spark

    View 1

    View 2

    View 3

    SparkPq

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    80/128

    From raw events to fast queries

    IngestionC*

    eventstore

    RawEvents

    RawEvents

    RawEvents

    Spark

    Spark

    Spark

    View 1

    View 2

    View 3

    Spark

    Shark

    Pq

    AH

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    81/128

    Our Spark/Shark/Cassandra Stack

    Node1

    Cassandra

    Node2

    Cassandra

    Nod

    Cassa

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    82/128

    Our Spark/Shark/Cassandra Stack

    Node1

    Cassandra

    InputFormat

    SerDe

    Node2

    Cassandra

    InputFormat

    SerDe

    Nod

    Cassa

    InputF

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    83/128

    Our Spark/Shark/Cassandra Stack

    Node1

    Cassandra

    InputFormat

    SerDe

    SparkWorker

    Shark

    Node2

    Cassandra

    InputFormat

    SerDe

    SparkWorker

    Shark

    Nod

    Cassa

    InputF

    SparkWorker

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    84/128

    Our Spark/Shark/Cassandra Stack

    Node1

    Cassandra

    InputFormat

    SerDe

    SparkWorker

    Shark

    Node2

    Cassandra

    InputFormat

    SerDe

    SparkWorker

    Shark

    Nod

    Cassa

    InputF

    SparkWorker

    Spark Master

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    85/128

    Our Spark/Shark/Cassandra Stack

    Node1

    Cassandra

    InputFormat

    SerDe

    SparkWorker

    Shark

    Node2

    Cassandra

    InputFormat

    SerDe

    SparkWorker

    Shark

    Nod

    Cassa

    InputF

    SparkWorker

    Spark Master Job Server

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    86/128

    Event Store Cassandra schema

    t0 t1 t2 t3 t4

    2013-04-05T00:00Z#id1

    {event0:

    a0}

    {event1:

    a1}

    {event2:

    a2}

    {event3:

    a3}

    {even

    a4}

    Event CF

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    87/128

    Event Store Cassandra schema

    t0 t1 t2 t3 t4

    2013-04-05T00:00Z#id1

    {event0:

    a0}

    {event1:

    a1}

    {event2:

    a2}

    {event3:

    a3}

    {even

    a4}

    ipaddr:10.20.30.40:t1 videoId:45678:t1 providerId:50

    2013-04-05T00:00Z#id1

    Event CF

    EventAttr CF

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    88/128

    Unpacking raw events

    t0 t12013-04-05T00:00Z#id1

    {video: 10,

    type:5}

    {video: 11,

    type:1}

    2013-04-05T00:00Z#id2

    {video: 20,

    type:5}

    {video: 25,

    type:9}

    UserID Video Tyid1 10 5

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    89/128

    Unpacking raw events

    t0 t12013-04-05T00:00Z#id1

    {video: 10,

    type:5}

    {video: 11,

    type:1}

    2013-04-05T00:00Z#id2

    {video: 20,

    type:5}

    {video: 25,

    type:9}

    UserID Video Tyid1 10 5

    id1 11 1

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    90/128

    Unpacking raw events

    t0 t12013-04-05T00:00Z#id1

    {video: 10,

    type:5}

    {video: 11,

    type:1}

    2013-04-05T00:00Z#id2

    {video: 20,

    type:5}

    {video: 25,

    type:9}

    UserID Video Tyid1 10 5

    id1 11 1

    id2 20 5

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    91/128

    Unpacking raw events

    t0 t12013-04-05T00:00Z#id1

    {video: 10,

    type:5}

    {video: 11,

    type:1}

    2013-04-05T00:00Z#id2

    {video: 20,

    type:5}

    {video: 25,

    type:9}

    UserID Video Tyid1 10 5

    id1 11 1

    id2 20 5

    id2 25 9

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    92/128

    Tips for InputFormat Development

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    93/128

    Tips for InputFormat Development

    Know which target platforms you are developing for

    Which API to write against? New? Old? Both?

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    94/128

    Tips for InputFormat Development

    Know which target platforms you are developing for

    Which API to write against? New? Old? Both?

    Be prepared to spend time tuning your split computation

    Low latency jobs require fast splits

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    95/128

    Tips for InputFormat Development

    Know which target platforms you are developing for

    Which API to write against? New? Old? Both?

    Be prepared to spend time tuning your split computation

    Low latency jobs require fast splits

    Consider sorting row keys by token for data locality

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    96/128

    Tips for InputFormat Development

    Know which target platforms you are developing for

    Which API to write against? New? Old? Both?

    Be prepared to spend time tuning your split computation

    Low latency jobs require fast splits

    Consider sorting row keys by token for data locality

    Implement predicate pushdown for HIVE SerDes

    Use your indexes to reduce size of dataset

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    97/128

    Example: OLAP processing

    t0

    2013-04

    -05T00:

    00Z#id1

    {video:

    10,

    type:5}

    2013-04

    -05T00:00Z#id2

    {video:

    20,type:5}

    C* events

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    98/128

    Example: OLAP processing

    t0

    2013-04

    -05T00:

    00Z#id1

    {video:

    10,

    type:5}

    2013-04

    -05T00:00Z#id2

    {video:

    20,type:5}

    C* events

    OLAPAggregates

    OLAPAggregates

    OLAPAggregates

    Cached Materialized Views

    Spark

    Spark

    Spark

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    99/128

    Example: OLAP processing

    t0

    2013-04

    -05T00:

    00Z#id1

    {video:

    10,

    type:5}

    2013-04

    -05T00:00Z#id2

    {video:

    20,type:5}

    C* events

    OLAPAggregates

    OLAPAggregates

    OLAPAggregates

    Cached Materialized Views

    Spark

    Spark

    Spark

    Union

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    100/128

    Example: OLAP processing

    t0

    2013-04

    -05T00:

    00Z#id1

    {video:

    10,

    type:5}

    2013-04

    -05T00:00Z#id2

    {video:

    20,type:5}

    C* events

    OLAPAggregates

    OLAPAggregates

    OLAPAggregates

    Cached Materialized Views

    Spark

    Spark

    Spark

    Union

    Query 1:by Provid

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    101/128

    Example: OLAP processing

    t0

    2013-04

    -05T00:

    00Z#id1

    {video:

    10,

    type:5}

    2013-04

    -05T00:00Z#id2

    {video:

    20,type:5}

    C* events

    OLAPAggregates

    OLAPAggregates

    OLAPAggregates

    Cached Materialized Views

    Spark

    Spark

    Spark

    Union

    Query 1:by Provid

    Query 2:

    content mobile

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    102/128

    Performance numbers

    6-node C*/DSE 1.1.9 cluster,

    Spark 0.7.0

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    103/128

    Performance numbers

    Spark: C* -> OLAP aggregatescold cache, 1.4 million events

    130 seconds

    6-node C*/DSE 1.1.9 cluster,

    Spark 0.7.0

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    104/128

    Performance numbers

    Spark: C* -> OLAP aggregatescold cache, 1.4 million events

    130 seconds

    C* -> OLAP aggregateswarmed cache

    20-30 seconds

    6-node C*/DSE 1.1.9 cluster,

    Spark 0.7.0

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    105/128

    Performance numbers

    Spark: C* -> OLAP aggregatescold cache, 1.4 million events

    130 seconds

    C* -> OLAP aggregateswarmed cache

    20-30 seconds

    OLAP aggregate query via Spark

    (56k records)

    60 ms

    6-node C*/DSE 1.1.9 cluster,

    Spark 0.7.0

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    106/128

    OLAP WorkFlow

    Aggregation JobSparkExecutors

    REST Job Server

    Aggregate

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    107/128

    OLAP WorkFlow

    Aggregation JobSparkExecutors

    Cassandra

    REST Job Server

    Aggregate

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    108/128

    OLAP WorkFlow

    DatasetAggregation JobSparkExecutors

    Cassandra

    REST Job Server

    Aggregate

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    109/128

    OLAP WorkFlow

    DatasetAggregation Job Query JobSparkExecutors

    Cassandra

    REST Job Server

    Aggregate Query

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    110/128

    OLAP WorkFlow

    DatasetAggregation Job Query JobSparkExecutors

    Cassandra

    REST Job Server

    Aggregate Query

    Result

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    111/128

    OLAP WorkFlow

    DatasetAggregation Job Query JobSparkExecutors

    Cassandra

    REST Job Server

    Query Job

    Aggregate Query

    Result

    Query

    Result

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    112/128

    Fault Tolerance

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    113/128

    Fault Tolerance

    Cached dataset lives in Java Heap only - what if process dies?

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    114/128

    Fault Tolerance

    Cached dataset lives in Java Heap only - what if process dies?

    Spark lineage - automatic recomputation from source, but this isexpensive!

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    115/128

    Fault Tolerance

    Cached dataset lives in Java Heap only - what if process dies?

    Spark lineage - automatic recomputation from source, but this isexpensive!

    Can also replicate cached dataset to survive single node failures

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    116/128

    Fault Tolerance

    Cached dataset lives in Java Heap only - what if process dies?

    Spark lineage - automatic recomputation from source, but this isexpensive!

    Can also replicate cached dataset to survive single node failures

    Persist materialized views back to C*, then load into cache -- nowrecovery path is much faster

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    117/128

    Fault Tolerance

    Cached dataset lives in Java Heap only - what if process dies?

    Spark lineage - automatic recomputation from source, but this isexpensive!

    Can also replicate cached dataset to survive single node failures

    Persist materialized views back to C*, then load into cache -- nowrecovery path is much faster

    Persistence also enables multiple processes to hold cached dataset

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    118/128

    Demo time

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    119/128

    Shark Demo

    Local shark node, 1 core, MBP

    How to create a table from C* using our inputformat

    Creating a cached Shark table

    Running fast queries

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    120/128

    Creating a Shark Table from InputFormat

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    121/128

    Creating a cached table

    Tuesday, June 18, 13

    Querying cached table

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    122/128

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    123/128

    THANK YOU

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    124/128

    THANK YOU

    @evanfchan

    Tuesday, June 18, 13

  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    125/128

    THANK YOU

    @evanfchan

    [email protected]

    Tuesday, June 18, 13

    mailto:[email protected]:[email protected]
  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    126/128

    THANK YOU

    @evanfchan

    [email protected]

    Tuesday, June 18, 13

    mailto:[email protected]:[email protected]
  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    127/128

    THANK YOU

    @evanfchan

    [email protected]

    WE ARE HIRING!!

    Tuesday, June 18, 13

    mailto:[email protected]:[email protected]
  • 8/9/2019 cassandra2013sparktalkfinal-130618122907-phpapp01

    128/128

    Spark: Under the hood

    Map DatasetReduce Map

    Driver Map DatasetReduce Map

    Map DatasetReduce Map

    One executor process per node