apache spark™ + ibm watson + twitter datapalooza sf 2015

75
©2015 IBM Corporation Spark + Watson + Twitter DataPalooza SF 2015 vid Taieb SM - IBM Cloud Data Services

Upload: mike-broberg

Post on 07-Apr-2017

828 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Spark + Watson + Twitter

DataPalooza SF 2015

David TaiebSTSM - IBM Cloud Data Services

Page 2: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Agenda• Introduction• Quick Introduction to Spark

• Set up development environment and create the hello world application• Notebook Walk-through• Spark Streaming

• Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer• Architectural Overview• Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub• Create the Streaming Receiver to connect to Kafka (Scala)• Create analytics using Jupyter Notebook (Python)• Create Real-time Web Dashboard (Nodejs)

Page 3: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Introduction

Page 4: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Introduction

Our mission:We are here to help developers realize their most ambitious projects.

Goals for today’s session:•Introduction to real time analytics using Spark Streaming•Technical Deep dive on the Spark + Watson + Twitter sample application•At the end of this session, you should be able to download the source code and run the application on IBM Analytics for Apache Spark

Page 5: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Agenda• Introduction• Quick Introduction to Spark

• Set up development environment and create the hello world application• Notebook Walk-through• Spark Streaming

• Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer• Architectural Overview• Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub• Create the Streaming Receiver to connect to Kafka (Scala)• Create analytics using Jupyter Notebook (Python)• Create Real-time Web Dashboard (Nodejs)

Page 6: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

What is spark

Spark is an open sourcein-memory

computing framework for distributed data processing

and iterative analysis

on massive data volumes

Page 7: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Spark Core Libraries

Spark Core

general compute engine, handles distributed task dispatching, scheduling

and basic I/O functions

Spark SQL

Spark Streaming

Mllib (machine learning)

GraphX (graph)

executes SQL

statements

performs streaming

analytics using micro-batches

common machine

learning and statistical algorithms

distributed graph

processing framework

Page 8: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Key reasons for interest in Spark Open Source

Fast

distributed data

processing

Productive

Web Scale

•In-memory storage greatly reduces disk I/O•Up to 100x faster in memory, 10x faster on disk

•Largest project and one of the most active on Apache•Vibrant growing community of developers continuously improve code base and extend capabilities

•Fast adoption in the enterprise (IBM, Databricks, etc…)

•Fault tolerant, seamlessly recompute lost data from hardware failure•Scalable: easily increase number of worker nodes•Flexible job execution: Batch, Streaming, Interactive

•Easily handle Petabytes of data without special code handling•Compatible with existing Hadoop ecosystem

•Unified programming model across a range of use cases•Rich and expressive apis hide complexities of parallel computing and worker node management

•Support for Java, Scala, Python and R: less code written•Include a set of core libraries that enable various analytic methods: Saprk SQL, Mllib, GraphX

Page 9: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

High level architecture

9

Spark Application

(driver)

Master(cluster

Manager)

Worker Node

Worker Node

Worker Node

Worker Node

Spark Cluster

Kernel

Master(cluster

Manager)

Worker Node

Worker Node

…Spark Cluster

Notebook Server

BrowserHttp/WebSockets

Kernel Protocol (e.g ZeroMQ)

Batch Job(Spark-Submit)

Interactive Notebook

• RDD Partitioning• Task packaging and

dispatching• Worker node scheduling

Page 10: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Spark programming model lifecycle

10

Load data into RDDs Apply transformation into new RDDs

Apply Actions (analytics) to

produce results • In memory collection:

• sc.parallelize• Unstructured data:

• Text: sc.textFile• HDFS: sc.hadoopFile

• Structured data:• Json: sqlCtxt.jsonFile• Parquet: sqlCtxt.parquetFile• Jdbc: sqlCtxt.load• Custom data source: 1.4+

• Streaming data:• TwitterUtils.createStream• KafkaUtils.createStream• FlumeUtils.createStream• MQTTUtils.createStream• Custom DStream

• Sc: SparkContext entry point: created by the application or automatically provided by Notebook shell

• sqlCtxt: SQLContext entry point for working with DataFrames and execute SQLQueries

• Create new RDDs by applying transformations to existing one

• map(fn): apply fn to all elements in RDD• flatMap(fn): Same as map, fn can return 0 or more

elements • filter(fn): select only elements for which fn returns

true• reduceByKey• sortByKey• Sample: sample a fraction of data• Union: combine elements of 2 RDDs• Intersection: intersect 2 RDDS• Distinct: remove duplicate elements• ….

• Produce results from running analytics against RDDs

• reduce(fn): perform summary operation on the elements

• collect(): return all elements in an Array • count(): count the number of elements in the

RDD• take(n): return the first n elements in an Array• foreach(fn): execute the fn on all the elements

in the RDD• saveAsTextFile: persist the elements in a text

file• ….

Page 11: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Job Scheduling

Page 12: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Ecosystem of the IBM Analytics for Apache Spark as service

Page 13: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Agenda• Introduction• Quick Introduction to Spark

• Set up development environment and create the hello world application• Notebook Walk-through• Spark Streaming

• Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer• Architectural Overview• Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub• Create the Streaming Receiver to connect to Kafka (Scala)• Create analytics using Jupyter Notebook (Python)• Create Real-time Web Dashboard (Nodejs)

Page 14: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Setup local development Environment

•Pre-requisites- Scala runtime 2.10.4 http://www.scala-lang.org/download/2.10.4.html- Homebrew http://brew.sh/- Scala sbt http://www.scala-sbt.org/download.html - Spark 1.3.1 http://www.apache.org/dyn/closer.lua/spark/spark-1.3.1/spark-1.3.1.tgz

•Detailled instructions here: https://developer.ibm.com/clouddataservices/start-developing-with-spark-and-notebooks/

Page 15: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Setup local development Environment contd..•Create scala project using sbt•Create directories to start from scratch

mkdir helloSpark && cd helloSparkmkdir -p src/main/scalamkdir -p src/main/java mkdir -p src/main/resourcesCreate a subdirectory under src/main/scala directory

mkdir -p com/ibm/cds/spark/sample

•Github URL for the same project https://github.com/ibm-cds-labs/spark.samples

Page 16: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Setup local development Environment contd..•Create HelloSpark.scala using an IDE or a text editor

• Copy paste this code snippetpackage com.ibm.cds.spark.samplesimport org.apache.spark._

object HelloSpark {    //main method invoked when running as a standalone Spark Application    def main(args: Array[String]) {        val conf = new SparkConf().setAppName("Hello Spark")        val spark = new SparkContext(conf)         println("Hello Spark Demo. Compute the mean and variance of a collection")        val stats = computeStatsForCollection(spark);        println(">>> Results: ")        println(">>>>>>>Mean: " + stats._1 );        println(">>>>>>>Variance: " + stats._2);        spark.stop()    }     //Library method that can be invoked from Jupyter Notebook    def computeStatsForCollection( spark: SparkContext, countPerPartitions: Int = 100000, partitions: Int=5): (Double, Double) = {            val totalNumber = math.min( countPerPartitions * partitions, Long.MaxValue).toInt;        val rdd = spark.parallelize( 1 until totalNumber,partitions);        (rdd.mean(), rdd.variance())    }}

Page 17: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Setup local development Environment contd..•Create a file build.sbt under the project root directory:

•Under the project root directory run

Check for helloSpark 2.10-10.jar under the project root directory

name := "helloSpark" version := "1.0" scalaVersion := "2.10.4" libraryDependencies ++= {    val sparkVersion =  "1.3.1"    Seq(        "org.apache.spark" %% "spark-core" % sparkVersion,        "org.apache.spark" %% "spark-sql" % sparkVersion,        "org.apache.spark" %% "spark-repl" % sparkVersion     )}

Download all dependencies $sbt update

Compile$sbt compile

Package an application jar file$sbt package

Page 18: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Hello World application on Bluemix Apache Starter

Page 19: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Agenda• Introduction• Quick Introduction to Spark

• Set up development environment and create the hello world application• Notebook Walk-through• Spark Streaming

• Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer• Architectural Overview• Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub• Create the Streaming Receiver to connect to Kafka (Scala)• Create analytics using Jupyter Notebook (Python)• Create Real-time Web Dashboard (Nodejs)

Page 20: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Introduction to Notebooks‣Notebooks allow creation of interactive executable documents that include rich text

with Markdown, executable code with Scala, Python or R, graphics with matplotlib‣Apache Spark provides multiple flavor APIs that can be executed with a REPL shell:

Scala, Python (PYSpark), R‣Multiple open-source implementations available:

- Jupyter: https://jupyter.org- Apache Zeppelin: http://zeppelin-project.org

Page 21: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Notebook walkthrough

‣Sign up on Bluemix https://console.ng.bluemix.net/registration/‣Getting started with Analytics for Apache Spark:

https://www.ng.bluemix.net/docs/services/AnalyticsforApacheSpark/index.html‣You can also follow tutorial here: https://developer.ibm.com/clouddataservices/start-

developing-with-spark-and-notebooks/

Page 22: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Page 23: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Agenda• Introduction• Quick Introduction to Spark

• Set up development environment and create the hello world application• Notebook Walk-through• Spark Streaming

• Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer• Architectural Overview• Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub• Create the Streaming Receiver to connect to Kafka (Scala)• Create analytics using Jupyter Notebook (Python)• Create Real-time Web Dashboard (Nodejs)

Page 24: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Spark Streaming‣“Spark Streaming is an extension of the core Spark API that enables scalable, high-

throughput, fault-tolerant stream processing of live data streams” (http://spark.apache.org/docs/latest/streaming-programming-guide.html)

‣Breakdown the Streaming data into smaller pieces which are then sent to the Spark Engine

Page 25: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Spark Streaming‣Provides connectors for multiple data sources:

- Kafka- Flume- Twitter- MQTT- ZeroMQ

‣Provides API to create custom connectors. Lots of examples available on Github and spark-packages.org

Page 26: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Agenda• Introduction• Quick Introduction to Spark

• Set up development environment and create the hello world application• Notebook Walk-through• Spark Streaming

• Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer• Architectural Overview• Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub• Create the Streaming Receiver to connect to Kafka (Scala)• Create analytics using Jupyter Notebook (Python)• Create Real-time Web Dashboard (Nodejs)

Page 27: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Spark + Twitter + Watson application‣Use Spark Streaming in combination with IBM Watson to perform sentiment

analysis and track how a conversation is trending on Twitter.

‣Use Spark Streaming to create a feed that captures live tweets from Twitter. You can optionally filter the tweets that contain the hashtag(s) of your choice.

‣The tweet data is then enriched in real time with various sentiment scores provided by the Watson Tone Analyzer service (available on Bluemix). This service provides insight into sentiment, or how the author feels.

‣The data is then loaded and analyzed by the data scientist within Notebook.

‣We can also use streaming analytics to feed a real-time web app dashboard

Page 28: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

About this sample application

• Github: https://github.com/ibm-cds-labs/spark.samples/tree/master/streaming-twitter• Tutorial: https://developer.ibm.com/clouddataservices/sentiment-analysis-of-twitter-hashtags

• A word about Scala• Scala is Object oriented but also support functional programming style• Bi-directional interoperability with Java• Resources:• Official web site: http://scala-lang.org• Excellent first steps site: http://www.artima.com/scalazine/articles/steps.html• Free e-books: http://readwrite.com/2011/04/30/5-free-b-books-and-tutorials-o

Page 29: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Agenda• Introduction• Quick Introduction to Spark

• Set up development environment and create the hello world application• Notebook Walk-through• Spark Streaming

• Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer• Architectural Overview• Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub• Create the Streaming Receiver to connect to Kafka (Scala)• Create analytics using Jupyter Notebook (Python)• Create Real-time Web Dashboard (Nodejs)

Page 30: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Spark Streaming with “IBM Insight for Twitter” and “Watson Tone Analyzer”

Watson Tone Analyzer Service Bluemix

Producer Stream

Enrich data with Emotion Tone Scores

Processed data

Scala Notebook IPython Notebook

Consumer Stream

Message Hub Service Bluemix

Full Archive Search API

Consumer Spark Topics

Publish topics from Spark analytics results

Event Hub Service Bluemix

Real-Time Dashboard

Data Engineer

Business AnalystC(Suite)

Data Scientist

Page 31: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Building a Spark Streaming application Sentiment analysis with Twitter and Watson Tone Analyzer

‣Configure Twitter and Watson Tone Analyzer1. Configure OAuth credentials for Twitter2. Create a Watson Tone Analyzer Service on Bluemix3. Configure MessageHub Service on Bluemix (Kafka)4. Configure EventHub Service on Bluemix

Page 32: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Configure OAuth credentials for Twitter‣You can follow along the steps in https://developer.ibm.com/clouddataservices/sentiment-analysis-of-twitter-hashtags/#twitter

Page 33: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Create a Watson Tone Analyzer Service on Bluemix ‣You can follow along the steps in https://developer.ibm.com/clouddataservices/sentiment-analysis-of-twitter-hashtags/#bluemix

Page 34: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Building a Spark Streaming application Sentiment analysis with Twitter and Watson Tone Analyzer

‣Work with Twitter data1. Create a Twitter Stream2. Enrich the data with sentiment analysis from Watson Tone Analyzer3. Aggregate data into RDD with enriched Data model4. Create SparkSQL DataFrame and register Table

Page 35: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Create a Twitter Stream

//Hold configuration key/value pairs val config = Map[String, String](

("twitter4j.oauth.consumerKey", Option(System.getProperty("twitter4j.oauth.consumerKey")).orNull ), ("twitter4j.oauth.consumerSecret", Option(System.getProperty("twitter4j.oauth.consumerSecret")).orNull ), ("twitter4j.oauth.accessToken", Option(System.getProperty("twitter4j.oauth.accessToken")).orNull ), ("twitter4j.oauth.accessTokenSecret", Option(System.getProperty("twitter4j.oauth.accessTokenSecret")).orNull ), ("tweets.key", Option(System.getProperty("tweets.key")).getOrElse("")), ("watson.tone.url", Option(System.getProperty("watson.tone.url")).orNull ), ("watson.tone.username", Option(System.getProperty("watson.tone.username")).orNull ), ("watson.tone.password", Option(System.getProperty("watson.tone.password")).orNull ) )

Create a map that stores the credentials for the Twitter and Watson Service

config.foreach( (t:(String,String)) => if ( t._1.startsWith( "twitter4j") ) System.setProperty( t._1, t._2 ))

Twitter4j requires credentials to be store in System properties

Page 36: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Create a Twitter Stream

//Filter the tweets to only keeps the one with english as the language//twitterStream is a discretized stream of twitter4j Status objects

var twitterStream = org.apache.spark.streaming.twitter.TwitterUtils.createStream( ssc, None ) .filter { status => Option(status.getUser).flatMap[String] {

u => Option(u.getLang) }.getOrElse("").startsWith("en") //Allow only tweets that use “en” as the language

&& CharMatcher.ASCII.matchesAllOf(status.getText) //Only pick text that are ASCII&& ( keys.isEmpty || keys.exists{status.getText.contains(_)}) //If User specified #hashtags to monitor

}

Initial DStream of Status Objects

Page 37: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Enrich the data with sentiment analysis from Watson Tone Analyzer

//Broadcast the config to each worker node val broadcastVar = sc.broadcast(config)

val rowTweets = twitterStream.map(status=> { lazy val client = PooledHttp1Client()

val sentiment = callToneAnalyzer(client, status, broadcastVar.value.get("watson.tone.url”).get,broadcastVar.value.get("watson.tone.username").get, broadcastVar.value.get("watson.tone.password").get

)…

}

Initial DStream of Status Objects

Page 38: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Enrich the data with sentiment analysis from Watson Tone Analyzer

Initial DStream of Status Objects

Data Model |-- author: string (nullable = true) |-- date: string (nullable = true) |-- lang: string (nullable = true) |-- text: string (nullable = true) |-- lat: integer (nullable = true) |-- long: integer (nullable = true) |-- Cheerfulness: double (nullable = true) |-- Negative: double (nullable = true) |-- Anger: double (nullable = true) |-- Analytical: double (nullable = true) |-- Confident: double (nullable = true) |-- Tentative: double (nullable = true) |-- Openness: double (nullable = true) |-- Agreeableness: double (nullable = true) |-- Conscientiousness: double (nullable = true)

DStream of key,value pairs

Page 39: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Aggregate data into RDD with enriched Data model…..

//Aggregate the data from each DStream into the working RDD

rowTweets.foreachRDD( rdd => {

if ( rdd.count() > 0 ){

workingRDD = sc.parallelize( rdd.map( t => t._1 ).collect()).union( workingRDD )

}

})

Initial DStream RowTweets

Initial DStream RowTweets

Initial DStream RowTweets

….Mic

roba

tche

s

Row 1Row 2Row 3Row 4

……

Row n

workingRDDData Model

|-- author: string (nullable = true) |-- date: string (nullable = true) |-- lang: string (nullable = true) |-- text: string (nullable = true) |-- lat: integer (nullable = true) |-- long: integer (nullable = true) |-- Cheerfulness: double (nullable = true) |-- Negative: double (nullable = true) |-- Anger: double (nullable = true) |-- Analytical: double (nullable = true) |-- Confident: double (nullable = true) |-- Tentative: double (nullable = true) |-- Openness: double (nullable = true) |-- Agreeableness: double (nullable = true) |-- Conscientiousness: double (nullable = true)

Page 40: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Create SparkSQL DataFrame and register Table //Create a SparkSQL DataFrame from the aggregate workingRDD

val df = sqlContext.createDataFrame( workingRDD, schemaTweets ) //Register a temporary table using the name "tweets" df.registerTempTable("tweets") println("A new table named tweets with " + df.count() + " records has been correctly created and can be accessed through the SQLContext variable") println("Here's the schema for tweets") df.printSchema() (sqlContext, df)

Row 1Row 2Row 3Row 4

……

Row n

workingRDD

author date lang … Cheerfulness Negative … Conscienti

ousness

John Smith 10/11/2015 – 20:18 en 0.0 65.8 … 25.5

Alfred … en 34.5 0.0 … 100.0

… … … … … …

… … … … … …

… … … … … …

Chris … en 85.3 22.9 … 0.0

Relational SparkSQL Table

Page 41: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Building a Spark Streaming application: Sentiment analysis with Twitter and Watson Tone Analyzer

‣IPython Notebook analysis1. Load the data into an IPython Notebook2. Analytic 1: Compute the distribution of tweets by sentiment scores greater than 60%3. Analytic 2: Compute the top 10 hashtags contained in the tweets4. Analytic 3: Visualize aggregated sentiment scores for the top 5 hashtags

Page 42: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Load the data into an IPython Notebook‣ You can follow along the steps here: https://github.com/ibm-cds-labs/spark.samples/blob/master/streaming-

twitter/notebook/Twitter%20%2B%20Watson%20Tone%20Analyzer%20Part%202.ipynb

Create a SQLContext from a SparkContext

Load from parquet file and create a DataFrame

Create a SQL table and start excuting SQL queries

Page 43: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Analytic 1 - Compute the distribution of tweets by sentiment scores greater than 60%

#create an array that will hold the count for each sentimentsentimentDistribution=[0] * 9#For each sentiment, run a sql query that counts the number of tweets for which the sentiment score is greater than 60%#Store the data in the arrayfor i, sentiment in enumerate(tweets.columns[-9:]): sentimentDistribution[i]=sqlContext.sql("SELECT count(*) as sentCount FROM tweets where " + sentiment + " > 60")\

.collect()[0].sentCount

Page 44: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Analytic 1 - Compute the distribution of tweets by sentiment scores greater than 60%

Use matplotlib to create a bar chart

Page 45: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Analytic 1 - Compute the distribution of tweets by sentiment scores greater than 60%

Bar Chart Visualization

Page 46: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Analytic 2: Compute the top 10 hashtags contained in the tweets

Initial Tweets

RDDFilter

hashtagsKey, value

pair RDD

Reduced map with

countsSorted

Map by key

flatMap filter map reduceByKey sortByKey

Page 47: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Analytic 2: Compute the top 10 hashtags contained in the tweets

Page 48: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Analytic 2: Compute the top 10 hashtags contained in the tweets

Page 49: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags

‣Problem:- Compute the mean average all the emotion score for all the top 10 hastags- Format the data in a way that can be consumed by the plot script

Page 50: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags

#Step 1: Create RDD from tweets dataframe tagsRDD = tweets.map(lambda t: t )

author … Cheerfulness

Jake … 0.0

Scrad … 23.5

Nittya Indika … 84.0

… … …

… … …

Madison … 93.0

tweets (Type: DataFrame)

Row(author=u'Jake', …, text=u’@sarahwag…’, Cheerfulness=0.0, …)Row(author=u’Scrad', …, text=u’ #SuperBloodMoon https://t…’, Cheerfulness=23.5, …)

Row(author=u’ Nittya Indika', …, text=u’ Good mornin! http://t.…’, Cheerfulness=84.0, …)

Row(author=u’ Madison', …, text=u’ how many nights…’, Cheerfulness=93.0, …)

tagsRDD (Type: RDD)

Page 51: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags

#Step 2: Filter to only keep the entries that are in top10tags tagsRDD = tagsRDD.filter( lambda t: any(s in t.text for s in [i[0] for i in top10tags] ) )

Row(author=u'Jake', …, text=u’@sarahwag…’, Cheerfulness=0.0, …)Row(author=u’Scrad', …, text=u’ #SuperBloodMoon https://t…’, Cheerfulness=23.5, …)

Row(author=u’ Nittya Indika', …, text=u’ Good mornin! http://t.…’, Cheerfulness=84.0, …)

……

Row(author=u’ Madison', …, text=u’ how many nights…’, Cheerfulness=93.0, …)

Row(author=u'Mike McGuire', text=u'Explains my disappointment #SuperBloodMoon https://t.co/Gfg7vWws5W', …, Conscientiousness=0.0)Row(author=u'Meng_tisoy', text=u’…hihi #ALDUBThisMustBeLove https://t….’,…,Conscientiousness=68.0)Row(author=u'Kevin Contreras', text=u’…SILA! #ALDUBThisMustBeLove', …Conscientiousness=68.0)

……

Row(author=u'abbi', text=u’…excited #ALDUBThisMustBeLove https://t…’,…, Conscientiousness=100.0)

Page 52: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags #Step 3: Create a flatMap using the expand function defined above, this will be used to collect all the scores

#for a particular tag with the following format: Tag-Tone-ToneScorecols = tweets.columns[-9:]def expand( t ):

ret = [ ] for s in [i[0] for i in top10tags]: if ( s in t.text ): for tone in cols: ret += [s + u"-" + unicode(tone) + ":" + unicode(getattr(t, tone))] return ret tagsRDD = tagsRDD.flatMap( expand )

Row(author=u'Mike McGuire', text=u'Explains my disappointment #SuperBloodMoon https://t.co/Gfg7vWws5W', …, Conscientiousness=0.0)Row(author=u'Meng_tisoy', text=u’…hihi #ALDUBThisMustBeLove https://t….’,…,Conscientiousness=68.0)Row(author=u'Kevin Contreras', text=u’…SILA! #ALDUBThisMustBeLove', …Conscientiousness=68.0)

…Row(author=u'abbi', text=u’…excited #ALDUBThisMustBeLove https://t…’,…, Conscientiousness=100.0)

u'#SuperBloodMoon-Cheerfulness:0.0'

u'#SuperBloodMoon-Negative:100.0’

u'#SuperBloodMoon-Negative:23.5'

u'#ALDUBThisMustBeLove-Analytical:85.0’

FlatMap of encoded values

Page 53: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags

#Step 4: Create a map indexed by Tag-Tone keys tagsRDD = tagsRDD.map( lambda fullTag : (fullTag.split(":")[0], float( fullTag.split(":")[1]) ))

u'#SuperBloodMoon-Cheerfulness:0.0'

u'#SuperBloodMoon-Negative:100.0’

u'#SuperBloodMoon-Negativer:23.5'

u'#ALDUBThisMustBeLove-Analytical:85.0’

u'#SuperBloodMoon-Cheerfulness' 0.0

u'#SuperBloodMoon-Negative’ 100.0

u'#SuperBloodMoon-Negative' 23.5

u'#ALDUBThisMustBeLove’ 85.0

map

Page 54: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags

#Step 5: Call combineByKey to format the data as follow #Key=Tag-Tone, Value=(count, sum_of_all_score_for_this_tone) tagsRDD = tagsRDD.combineByKey((lambda x: (x,1)), (lambda x, y: (x[0] + y, x[1] + 1)), (lambda x, y: (x[0] + y[0], x[1] + y[1])))

u'#SuperBloodMoon-Cheerfulness' 0.0

u'#SuperBloodMoon-Negative’ 100.0

u'#SuperBloodMoon-Negative' 23.5

u'#ALDUBThisMustBeLove’ 85.0

u'#Supermoon-Confident’ (0.0, 3)

u'#HajjStampede-Tentative’ (0.0, 3)

u'#KiligKapamilya-Conscientiousness’ (290.0, 6)

u'#LunarEclipse-Tentative’ (92.0, 4)

CreateCombiner: Create list of tuples (sum,count)

mergeValue: called for each new value (sum, count)

MergeCombiner: reduce part, merge 2 combiners

Page 55: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags

#Step 6 : ReIndex the map to have the key be the Tag and value be (Tone, Average_score) tuple #Key=Tag #Value=(Tone, average_score) tagsRDD = tagsRDD.map(lambda (key, ab): (key.split("-")[0], (key.split("-")[1], round(ab[0]/ab[1], 2))))

u'#Supermoon-Confident’ (0.0, 3)

u'#HajjStampede-Tentative’ (0.0, 3)

u'#KiligKapamilya-Conscientiousness’ (290.0, 6)

u'#LunarEclipse-Tentative’ (92.0, 4)

u'#Supermoon-Confident’ (u'Confident', 0.0)

u'#HajjStampede-Tentative’ (u'Tentative', 0.0)

u'#KiligKapamilya-Conscientiousness’

(u'Conscientiousness', 48.33)

u'#LunarEclipse-Tentative’ (u'Tentative', 23.0)

Page 56: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags

#Step 7: Reduce the map on the Tag key, value becomes a list of (Tone,average_score) tuples tagsRDD = tagsRDD.reduceByKey( lambda x, y : makeList(x) + makeList(y) )

u'#Supermoon-Confident’ (u'Confident', 0.0)

u'#HajjStampede-Tentative’ (u'Tentative', 0.0)

u'#KiligKapamilya-Conscientiousness’

(u'Conscientiousness', 48.33)

u'#LunarEclipse-Tentative’ (u'Tentative', 23.0)

u'#HajjStampede' [(u'Tentative', 0.0), (u'Agreeableness', 3.67), …, (u'Cheerfulness', 100.0)]

u'#Supermoon'[(u'Confident', 0.0), (u'Openness', 91.0), …, (u'Agreeableness', 20.33)]

u'#bloodmoon' [(u'Anger', 0.0), (u'Negative', 0.0), …, (u'Openness', 38.0)]

u'#KiligKapamilya'

[(u'Conscientiousness', 48.33), (u'Anger', 0.0),... (u'Agreeableness', 10.83)]

Page 57: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags

#Step 8 : Sort the (Tone,average_score) tuples alphabetically by Tone tagsRDD = tagsRDD.mapValues( lambda x : sorted(x) )

u'#HajjStampede' [(u'Tentative', 0.0), (u'Agreeableness', 3.67), …, (u'Cheerfulness', 100.0)]

u'#Supermoon'[(u'Confident', 0.0), (u'Openness', 91.0), …, (u'Agreeableness', 20.33)]

u'#bloodmoon' [(u'Anger', 0.0), (u'Negative', 0.0), …, (u'Openness', 38.0)]

u'#KiligKapamilya'

[(u'Conscientiousness', 48.33), (u'Anger', 0.0),... (u'Agreeableness', 10.83)]

u'#HajjStampede'[(u'Agreeableness', 3.67),(u'Cheerfulness', 100.0),… (u'Tentative', 0.0),]

u'#Supermoon'[(u'Agreeableness', 20.33), (u'Confident', 0.0),..., (u'Openness', 91.0)]

u'#bloodmoon' [(u'Anger', 0.0), (u'Negative', 0.0), …, (u'Openness', 38.0)]

u'#KiligKapamilya'

[(u'Agreeableness', 10.83), (u'Anger', 0.0)(u'Conscientiousness', 48.33),,...]

Page 58: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags

#Step 9 : Format the data as expected by the plotting code in the next cell. #map the Values to a tuple as follow: ([list of tone], [list of average score]) tagsRDD = tagsRDD.mapValues( lambda x : ([elt[0] for elt in x],[elt[1] for elt in x]) )

u'#HajjStampede'[(u'Agreeableness', 3.67),(u'Cheerfulness', 100.0),… (u'Tentative', 0.0),]

u'#Supermoon'[(u'Agreeableness', 20.33), (u'Confident', 0.0),..., (u'Openness', 91.0)]

u'#bloodmoon' [(u'Anger', 0.0), (u'Negative', 0.0), …, (u'Openness', 38.0)]

u'#KiligKapamilya'

[(u'Agreeableness', 10.83), (u'Anger', 0.0)(u'Conscientiousness', 48.33),,...]

u'#HajjStampede' ([u'Agreeableness’,u'Cheerfulness’,… u'Tentative’], [3.67, 100.0,…0.0])

u'#Supermoon'([u'Agreeableness’,u'Confident',..., u'Openness’],[20.33, 0.0,… 91.0])

u'#bloodmoon' ([u'Anger’,u'Negative', …, u'Openness’), [0.0, 0.0,…38.0])

u'#KiligKapamilya'

([u'Agreeableness’,u'Anger’, u'Conscientiousness',...],[10.83, 0.0,48.33,...])

Value is a tuple of 2 arrays: tones-scores

Page 59: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags

#Step 10 : Use custom sort function to sort the entries by order of appearance in top10tags def customCompare( key ): for (k,v) in top10tags: if k == key: return v return 0 tagsRDD = tagsRDD.sortByKey(ascending=False, numPartitions=None, keyfunc = customCompare)

u'#HajjStampede' ([u'Agreeableness’,u'Cheerfulness’,… u'Tentative’], [3.67, 100.0,…0.0])

u'#Supermoon'([u'Agreeableness’,u'Confident',..., u'Openness’],[20.33, 0.0,… 91.0])

u'#bloodmoon' ([u'Anger’,u'Negative', …, u'Openness’), [0.0, 0.0,…38.0])

u'#KiligKapamilya'

([u'Agreeableness’,u'Anger’, u'Conscientiousness',...],[10.83, 0.0,48.33,...])

u'#Superbloodmon'([u'Agreeableness’,u'Cheerfulness’,… u'Tentative’], [33.97, 19.38,…12.85])

u'#BBWLA'([u'Agreeableness’,u'Confident',..., u'Openness’],[38.33, 12.34,… 21.43])

u'#ALDUBThisMustBeLove'

([u'Anger’,u'Negative', …, u'Openness’), [0.0, 0.0,…62.0])

u'#Newmusic'([u'Agreeableness’,u'Anger’, u'Conscientiousness',...],[0.0, 0.0,68.33,...])

Page 60: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags

Page 61: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags

Page 62: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Agenda• Introduction• Quick Introduction to Spark

• Set up development environment and create the hello world application• Notebook Walk-through• Spark Streaming

• Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer• Architectural Overview• Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub• Create the Streaming Receiver to connect to Kafka (Scala)• Create analytics using Jupyter Notebook (Python)• Create Real-time Web Dashboard (Nodejs)

Page 63: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Spark Streaming with “IBM Insight for Twitter” and “Watson Tone Analyzer”

Watson Tone Analyzer Service Bluemix

Producer Stream

Enrich data with Emotion Tone Scores

Processed data

Scala Notebook IPython Notebook

Consumer Stream

Message Hub Service Bluemix

Full Archive Search API

Consumer Spark Topics

Publish topics from Spark analytics results

Event Hub Service Bluemix

Real-Time Dashboard

Data Engineer

Business AnalystC(Suite)

Data Scientist

Page 64: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Real-Time Web app Dashboard

‣Pie chart showing top Hashtags distribution

‣Bar chart showing distribution of tone scores for each of top HashTags

Page 65: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Create a Receiver that subscribes to Kafka topics

Store new record into DStream

Get batch of new records

MessageHub on Bluemix requires Kafka 0.9

Page 66: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Create Kafka DStream

Implicit conversion to add synthetically add method to StreamingContext

Page 67: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Enrich Tweets with Watson Scores

Get Tone scores

Map to new EnrichedTweet Object

Page 68: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Streaming analytics

Prepare for Map/Reduce

Map tag-tone to corresponding score

Compute Count + Average for each score

Map each tag to count + List of scores averages

Reduce

Page 69: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Maintain State between micro-batch RDDs

Maintain State between micro-batches by recomputing count and List of averages

Page 70: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Produce Streaming analytics topic data

Can’t call Kakfa Producer from streaming analytic because not serializablePost message to queue

Process message queue from separate Thread

Page 71: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Real-time web app dashboard

‣Technology used:- Mozaik (https://github.com/plouc/mozaik)- ReactJS, - WebSocket- D3JS/C3JS

‣Consume Topics generated by Spark Streaming analytics

Consumer Spark Topics

Real-Time Dashboard

Topics:•topHashTags•topHashTags.toneScores

Page 72: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Access MessageHub API through message-hub-rest node module

Page 73: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

React Components for Mozaik framework

Page 74: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Demo!

Page 75: Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015

©2015 IBM Corporation

Thank You