datastax & o'reilly media: large scale data analytics with spark and cassandra on the...
Post on 11-Apr-2017
421 Views
Preview:
TRANSCRIPT
Large Scale Data Analytics
Ryan Knight @Knight_Cloud
Solution Engineer - DataStax
Paco Nathan @pacoid
Evil Mad Scientist - O’Reilly Media
Demo of Streaming in the Real World - Spark At Scale Project
3© 2015. All Rights Reserved.
•Based on Real World Use Cases
•Simulate a real world streaming use case
•Test throughput of Spark Streaming
•Best Practices for scaling
•https://github.com/retroryan/SparkAtScale
Spark At Scale Demo Application
4© 2015. All Rights Reserved.
DataStax Enterprise Platform
DataStax Enterprise Platform
Data Modeling using Event Sourcing
6© 2015. All Rights Reserved.
•Append-Only Logging
•Database of Facts
•Snapshots or Roll-Ups
•Why Delete Data any more?
•Replay Events
Scala for Large Scale Data Analytics
7© 2015. All Rights Reserved.
•Functional Paradigm is ideal for Data Analytics
•Strongly Typed - Enforce Schema at Every Later
•Immutable by Default - Event Logging
•Declarative instead of Imperative - Focus on Transformation not Implementation
Key to Scaling - Configuring Kafka Topics
8© 2015. All Rights Reserved.
•Number of Partitions per Topic — Degree of parallelism
•Directly Affects Spark Streaming Parallelism
•bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 5 --topic ratings
Populating Kafka Topics
9© 2015. All Rights Reserved.
val record = new ProducerRecord[String, String] (feederExtension.kafkaTopic, partNum, key, nxtRating.toString)
val future = feederExtension.producer.send(record, new Callback {
Spark Streaming with Kafka Direct Approach
11© 2015. All Rights Reserved.
•Use Kafka Direct Approach (No Receivers)
•Queries Kafka Directly
•Automatically Parallelizes based on Kafka Partitions
•Exactly Once Processing - Only Move Offset after Processing
•Resiliency without copying data
Spark Streaming Monitoring
13© 2015. All Rights Reserved.
Processing Time
>Batch Duration
=Total Delay Grows
Out Of Memory Errors
© 2014 DataStax, All Rights Reserved.
Confidential
DataStax Enterprise Platform Workload Segregation w/out ETL
14
Cassandra Mode OLTP Database
Analytics Mode Streaming and Analytics
Search Mode All Data Searchable
C*
C
C
S A
A
DataStax Analytics
17© 2015. All Rights Reserved.
•Simplified Deployment and Management •HA Spark Master with automatic leader election
•Detects when Spark Master is down with gossip •Uses Paxos to elect Spark Master
•Stores Spark Worker metadata in Cassandra •No need to run Zookeeper
Spark Notebook
18© 2015. All Rights Reserved.
C*
C
C A
AANotebook
Notebook
Notebook
Spark Notebook ServerCassandra Cluster with Spark Connector
Apache Spark Notebook
19© 2015. All Rights Reserved.
•Reactive / Dynamic Graphs base on Scala, SQL and DataFrames
•Spark Streaming • Examples notebooks covering visualization, machine
learning, streaming, graph analysis, genomics analysis •SVG / Sliders - interactive graphs •Tune and Configure Each Notebook Separately •https://github.com/andypetrella/spark-notebook
databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/README.html
Demo: Twitter Streaming Language Classifier
Streaming:collect tweets
Twitter API
HDFS:dataset
Spark SQL:ETL, queries
MLlib:train classifier
Spark:featurize
HDFS:model
Streaming:score tweets
language filter
Demo: Twitter Streaming Language Classifier
Cassandra
Cassandra
1. extract text from the tweet
https://twitter.com/andy_bf/status/16222269370011648
"Ceci n'est pas un tweet"
2. sequence text as bigrams
tweet.sliding(2).toSeq ("Ce", "ec", "ci", …, )
3. convert bigrams into numbers
seq.map(_.hashCode()) (2178, 3230, 3174, …, )
4. index into sparse tf vector
seq.map(_.hashCode() % 1000) (178, 230, 174, …, )
5. increment feature count
Vector.sparse(1000, …) (1000, [102, 104, …], [0.0455, 0.0455, …])
Demo: Twitter Streaming Language Classifier
From tweets to ML features, approximated as sparse vectors:
Demo: Twitter Streaming Language Classifier
Sample Code + Output: https://github.com/retroryan/twitter_classifier
val sc = new SparkContext(new SparkConf())
val ssc = new StreamingContext(conf, Seconds(5))
val tweets = TwitterUtils.createStream(ssc, Utils.getAuth)
val statuses = tweets.map(_.getText)
val model = new KMeansModel(ssc.sparkContext.objectFile[Vector]
(modelFile.toString).collect())
val filteredTweets = statuses
.filter(t =>
model.predict(Utils.featurize(t)) == clust)
filteredTweets.print()
ssc.start()
ssc.awaitTermination()
CLUSTER 1:TLあんまり見ないけど@くれたっらいつでもくっるよ٩(δωδ)۶
そういえばディスガイアも今日か CLUSTER 4:قالوا العروبه روحت بعد صدامواقول مع سلمان تحيى العروبهRT @vip588: √ للمتواجدين االن √ زيادة متابعني √ فولو مي vip588
فولو باك √ رتويت للتغريدة √ فولو للي عمل رتويت √ اللي ما يلتزم ما √… بيستفيدن سورة
top related