duyhai doan - real time analytics with cassandra and spark - nosql matters paris 2015

49
@doanduyhai Real time data processing with Spark & Cassandra DuyHai DOAN, Technical Advocate

Upload: nosqlmatters

Post on 19-Jul-2015

309 views

Category:

Software


5 download

TRANSCRIPT

Page 1: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Real time data processing with Spark & Cassandra DuyHai DOAN, Technical Advocate

Page 2: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Who Am I ?!Duy Hai DOAN Cassandra technical advocate •  talks, meetups, confs •  open-source devs (Achilles, …) •  OSS Cassandra point of contact

[email protected] @doanduyhai

2

Page 3: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Datastax!•  Founded in April 2010

•  We contribute a lot to Apache Cassandra™

•  400+ customers (25 of the Fortune 100), 200+ employees

•  Headquarter in San Francisco Bay area

•  EU headquarter in London, offices in France and Germany

•  Datastax Enterprise = OSS Cassandra + extra features

3

Page 4: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

Spark & Cassandra Integration!

Spark & its eco-system!Cassandra & token ranges!

Stand-alone cluster deployment!!

Page 5: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

What is Apache Spark ?!Created at Apache Project since 2010 General data processing framework MapReduce is not the A & ΩΩ One-framework-many-components approach

5

Page 6: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Spark characteristics!Fast •  10x-100x faster than Hadoop MapReduce •  In-memory storage •  Single JVM process per node, multi-threaded

Easy •  Rich Scala, Java and Python APIs (R is coming …) •  2x-5x less code •  Interactive shell

6

Page 7: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Spark code example!Setup

Data-set (can be from text, CSV, JSON, Cassandra, HDFS, …)

val$conf$=$new$SparkConf(true)$$ .setAppName("basic_example")$$ .setMaster("local[3]")$$val$sc$=$new$SparkContext(conf)$

val$people$=$List(("jdoe","John$DOE",$33),$$$$$$$$$$$$$$$$$$$("hsue","Helen$SUE",$24),$$$$$$$$$$$$$$$$$$$("rsmith",$"Richard$Smith",$33))$

7

Page 8: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

RDDs!RDD = Resilient Distributed Dataset val$parallelPeople:$RDD[(String,$String,$Int)]$=$sc.parallelize(people)$$val$extractAge:$RDD[(Int,$(String,$String,$Int))]$=$parallelPeople$$ $ $ $ $ $ .map(tuple$=>$(tuple._3,$tuple))$$val$groupByAge:$RDD[(Int,$Iterable[(String,$String,$Int)])]=extractAge.groupByKey()$$val$countByAge:$Map[Int,$Long]$=$groupByAge.countByKey()$

8

Page 9: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

RDDs!RDD[A] = distributed collection of A •  RDD[Person] •  RDD[(String,Int)], …

RDD[A] split into partitions Partitions distributed over n workers à parallel computing

9

Page 10: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Spark eco-system!

Local Standalone cluster YARN Mesos

Spark Core Engine (Scala/Java/Python)

Spark Streaming MLLib GraphX Spark SQL

Persistence

Cluster Manager

10

Page 11: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Spark eco-system!

Local Standalone cluster YARN Mesos

Spark Core Engine (Scala/Java/Python)

Spark Streaming MLLib GraphX Spark SQL

Persistence

Cluster Manager

11

Page 12: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

What is Apache Cassandra?!Created at Apache Project since 2009 Distributed NoSQL database Eventual consistency (A & P of the CAP theorem) Distributed table abstraction

12

Page 13: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Cassandra data distribution reminder!Random: hash of #partition → token = hash(#p) Hash: ]-X, X] X = huge number (264/2)

n1

n2

n3

n4

n5

n6

n7

n8

13

Page 14: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Cassandra token ranges!A: ]0, X/8] B: ] X/8, 2X/8] C: ] 2X/8, 3X/8] D: ] 3X/8, 4X/8] E: ] 4X/8, 5X/8] F: ] 5X/8, 6X/8] G: ] 6X/8, 7X/8] H: ] 7X/8, X] Murmur3 hash function

n1

n2

n3

n4

n5

n6

n7

n8

A

B

C

D

E

F

G

H

14

Page 15: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Linear scalability!

n1

n2

n3

n4

n5

n6

n7

n8

A

B

C

D

E

F

G

H

user_id1

user_id2

user_id3

user_id4

user_id5

15

Page 16: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Linear scalability!

n1

n2

n3

n4

n5

n6

n7

n8

A

B

C

D

E

F

G

H

user_id1

user_id2

user_id3

user_id4

user_id5

16

Page 17: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Cassandra Query Language (CQL)!

INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33);

UPDATE users SET age = 34 WHERE login = ‘jdoe’;

DELETE age FROM users WHERE login = ‘jdoe’;

SELECT age FROM users WHERE login = ‘jdoe’;

17

Page 18: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Why Spark on Cassandra ?!Reliable persistent store (HA)

Structured data (Cassandra CQL à Dataframe API)

Multi data-center !!!

For Spark

18

Page 19: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Why Spark on Cassandra ?!Reliable persistent store (HA)

Structured data (Cassandra CQL à Dataframe API)

Multi data-center !!!

Cross-table operations (JOIN, UNION, etc.)

Real-time/batch processing

Complex analytics (e.g. machine learning)

For Spark

For Cassandra

19

Page 20: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Use Cases!

Load data from various sources

Analytics (join, aggregate, transform, …)

Sanitize, validate, normalize data

Schema migration, Data conversion

20

Page 21: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Cluster deployment!C*

SparkM SparkW

C* SparkW

C* SparkW

C* SparkW

C* SparkW

Stand-alone cluster

21

Page 22: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Cluster deployment!

Spark Master

Spark Worker Spark Worker Spark Worker Spark Worker

Executor Executor Executor Executor

Driver Program

Cassandra – Spark placement 1 Cassandra process ⟷ 1 Spark worker

C* C* C* C*

22

Page 23: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

Spark & Cassandra Connector!

Core API!SparkSQL!

SparkStreaming!

Page 24: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Connector architecture!All Cassandra types supported and converted to Scala types Server side data filtering (SELECT … WHERE …) Use Java-driver underneath !Scala and Java support

24

Page 25: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Connector architecture – Core API!Cassandra tables exposed as Spark RDDs

Read from and write to Cassandra

Mapping of C* tables and rows to Scala objects •  CassandraRow •  Scala case class (object mapper) •  Scala tuples

25

Page 26: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Connector architecture – Spark SQL !

Mapping of Cassandra table to SchemaRDD •  CassandraSQLRow à SparkRow •  custom query plan •  push predicates to CQL for early filtering

SELECT * FROM user_emails WHERE login = ‘jdoe’;

26

Page 27: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Connector architecture – Spark Streaming !

Streaming data INTO Cassandra table •  trivial setup •  be careful about your Cassandra data model !!!

Streaming data OUT of Cassandra tables ? •  work in progress …

27

Page 28: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

Connector API !

Connector API!Data Locality Implementation!

Page 29: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Connector API!Connecting to Cassandra

!//!Import!Cassandra.specific!functions!on!SparkContext!and!RDD!objects!!import!com.datastax.driver.spark._!!!!//!Spark!connection!options!!val!conf!=!new!SparkConf(true)!! .setMaster("spark://192.168.123.10:7077")!! .setAppName("cassandra.demo")!! .set("cassandra.connection.host","192.168.123.10")!//!initial!contact!! .set("cassandra.username",!"cassandra")!! .set("cassandra.password",!"cassandra")!!!val!sc!=!new!SparkContext(conf)!

29

Page 30: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Connector API!Preparing test data

CREATE&TABLE&test.words&(word&text&PRIMARY&KEY,&count&int);&&INSERT&INTO&test.words&(word,&count)&VALUES&('bar',&30);&INSERT&INTO&test.words&(word,&count)&VALUES&('foo',&20);&

30

Page 31: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Connector API!Reading from Cassandra

!//!Use!table!as!RDD!!val!rdd!=!sc.cassandraTable("test",!"words")!!//!rdd:!CassandraRDD[CassandraRow]!=!CassandraRDD[0]!!!rdd.toArray.foreach(println)!!//!CassandraRow[word:!bar,!count:!30]!!//!CassandraRow[word:!foo,!count:!20]!!!rdd.columnNames!!!!//!Stream(word,!count)!!rdd.size!!!!!!!!!!!//!2!!!val!firstRow!=!rdd.first!!//firstRow:CassandraRow=CassandraRow[word:!bar,!count:!30]!!!firstRow.getInt("count")!!//!Int!=!30!

31

Page 32: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Connector API!Writing data to Cassandra

!val!newRdd!=!sc.parallelize(Seq(("cat",!40),!("fox",!50)))!!!//!newRdd:!org.apache.spark.rdd.RDD[(String,!Int)]!=!ParallelCollectionRDD[2]!!!!!newRdd.saveToCassandra("test",!"words",!Seq("word",!"count"))!

SELECT&*&FROM&test.words;&&&&&&word&|&count&&&&&&999999+9999999&&&&&&bar&|&&&&30&&&&&&foo&|&&&&20&&&&&&cat&|&&&&40&&&&&&fox&|&&&&50&&

32

Page 33: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Remember token ranges ?!A: ]0, X/8] B: ] X/8, 2X/8] C: ] 2X/8, 3X/8] D: ] 3X/8, 4X/8] E: ] 4X/8, 5X/8] F: ] 5X/8, 6X/8] G: ] 6X/8, 7X/8] H: ] 7X/8, X]

n1

n2

n3

n4

n5

n6

n7

n8

A

B

C

D

E

F

G

H

33

Page 34: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Data Locality!C*

SparkM SparkW

C* SparkW

C* SparkW

C* SparkW

C* SparkW

Spark partition RDD

Cassandra tokens ranges

34

Page 35: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Data Locality!C*

SparkM SparkW

C* SparkW

C* SparkW

C* SparkW

C* SparkW

Use Murmur3Partitioner

35

Page 36: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Read data locality!Read from Cassandra

Spark shuffle operations

36

Page 37: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Repartition before write !

Write to Cassandra

rdd.repartitionByCassandraReplica("keyspace","table")

37

Page 38: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Or async batch writes!

Async batches fan-out writes to Cassandra

Spark shuffle operations

38

Page 39: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Write data locality!

39

•  either stream data with Spark using repartitionByCassandraReplica() •  or flush data to Cassandra by async batches •  in any case, there will be data movement on network (sorry no magic)

Page 40: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Joins with data locality!

40

CREATE TABLE artists(name text, style text, … PRIMARY KEY(name));

CREATE TABLE albums(title text, artist text, year int,… PRIMARY KEY(title));

val join: CassandraJoinRDD[(String,Int), (String,String)] = sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS) // Select only useful columns for join and processing .select("artist","year") .as((_:String, _:Int)) // Repartition RDDs by "artists" PK, which is "name" .repartitionByCassandraReplica(KEYSPACE, ARTISTS) // Join with "artists" table, selecting only "name" and "country" columns .joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country")) .on(SomeColumns("name"))

Page 41: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Joins pipeline with data locality!

41

val join: CassandraJoinRDD[(String,Int), (String,String)] = sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS) // Select only useful columns for join and processing .select("artist","year") .as((_:String, _:Int)) .repartitionByCassandraReplica(KEYSPACE, ARTISTS) .joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country")) .on(SomeColumns("name")) .map(…) .filter(…) .groupByKey() .mapValues(…) .repartitionByCassandraReplica(KEYSPACE, ARTISTS_RATINGS) .joinWithCassandraTable(KEYSPACE, ARTISTS_RATINGS) … !!

Page 42: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

Perfect data locality scenario!

42

•  read localy from Cassandra •  use operations that do not require shuffle in Spark (map, filter, …) •  repartitionbyCassandraReplica() •  à to a table having same partition key as original table •  save back into this Cassandra table

Page 43: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

Demo

https://github.com/doanduyhai/Cassandra-Spark-Demo

Page 44: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

What’s for future ?!Datastax Enterprise 4.7 •  Cassandra + Spark + Solr as your analytics platform Filter out most data possible with Solr from Cassandra Fetch the filtered data in Spark and perform aggregations Save back final data into Cassandra

44

Page 45: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

What’s for future ?!What’s about data locality ?

45

Page 46: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

val join: CassandraJoinRDD[(String,Int), (String,String)] =

sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS)

// Select only useful columns for join and processing

.select("artist","year").where("solr_query = 'style:*rock* AND ratings:[3 TO *]' ")

.as((_:String, _:Int))

.repartitionByCassandraReplica(KEYSPACE, ARTISTS)

.joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country"))

.on(SomeColumns("name")).where("solr_query = 'age:[20 TO 30]' ")

What’s for future ?!

1.  compute Spark partitions using Cassandra token ranges 2.  on each partition, use Solr for local data filtering (no fan out !) 3.  fetch data back into Spark for aggregations 4.  repeat 1 – 3 as many times as necessary

46

Page 47: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

@doanduyhai

What’s for future ?!

47

SELECT … FROM … WHERE token(#partition)> 3X/8 AND token(#partition)<= 4X/8 AND solr_query='full text search expression';

1

2

3

Advantages of same JVM Cassandra + Solr integration

1

Single-pass local full text search (no fan out) 2

Data retrieval

D: ] 3X/8, 4X/8]

Page 48: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

Q & R

! " !

Page 49: DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Paris 2015

Thank You @doanduyhai

[email protected]

https://academy.datastax.com/