apache spark™ + ibm watson + twitter datapalooza sf 2015

©2015 IBM Corporation

Spark + Watson + Twitter

DataPalooza SF 2015

David TaiebSTSM - IBM Cloud Data Services


Agenda• Introduction• Quick Introduction to Spark

• Set up development environment and create the hello world application• Notebook Walk-through• Spark Streaming

• Deep dive: Sentiment analysis with Twitter and Watson Tone Analyzer• Architectural Overview• Set up the Bluemix services: Watson Tone Analyzer, Message Hub and Event Hub• Create the Streaming Receiver to connect to Kafka (Scala)• Create analytics using Jupyter Notebook (Python)• Create Real-time Web Dashboard (Nodejs)


Introduction


Introduction

Our mission:We are here to help developers realize their most ambitious projects.

Goals for today’s session:•Introduction to real time analytics using Spark Streaming•Technical Deep dive on the Spark + Watson + Twitter sample application•At the end of this session, you should be able to download the source code and run the application on IBM Analytics for Apache Spark


What is spark

Spark is an open sourcein-memory

computing framework for distributed data processing

and iterative analysis

on massive data volumes


Spark Core Libraries

Spark Core

general compute engine, handles distributed task dispatching, scheduling

and basic I/O functions

Spark SQL

Spark Streaming

Mllib (machine learning)

GraphX (graph)

executes SQL

statements

performs streaming

analytics using micro-batches

common machine

learning and statistical algorithms

distributed graph

processing framework


Key reasons for interest in Spark Open Source

Fast

distributed data

processing

Productive

Web Scale

•In-memory storage greatly reduces disk I/O•Up to 100x faster in memory, 10x faster on disk

•Largest project and one of the most active on Apache•Vibrant growing community of developers continuously improve code base and extend capabilities

•Fast adoption in the enterprise (IBM, Databricks, etc…)

•Fault tolerant, seamlessly recompute lost data from hardware failure•Scalable: easily increase number of worker nodes•Flexible job execution: Batch, Streaming, Interactive

•Easily handle Petabytes of data without special code handling•Compatible with existing Hadoop ecosystem

•Unified programming model across a range of use cases•Rich and expressive apis hide complexities of parallel computing and worker node management

•Support for Java, Scala, Python and R: less code written•Include a set of core libraries that enable various analytic methods: Saprk SQL, Mllib, GraphX


High level architecture

9

Spark Application

(driver)

Master(cluster

Manager)

Worker Node

Worker Node

Worker Node

Worker Node

…

Spark Cluster

Kernel

Master(cluster

Manager)

Worker Node

Worker Node

…Spark Cluster

Notebook Server

BrowserHttp/WebSockets

Kernel Protocol (e.g ZeroMQ)

Batch Job(Spark-Submit)

Interactive Notebook

• RDD Partitioning• Task packaging and

dispatching• Worker node scheduling


Spark programming model lifecycle

10

Load data into RDDs Apply transformation into new RDDs

Apply Actions (analytics) to

produce results • In memory collection:

• sc.parallelize• Unstructured data:

• Text: sc.textFile• HDFS: sc.hadoopFile

• Structured data:• Json: sqlCtxt.jsonFile• Parquet: sqlCtxt.parquetFile• Jdbc: sqlCtxt.load• Custom data source: 1.4+

• Streaming data:• TwitterUtils.createStream• KafkaUtils.createStream• FlumeUtils.createStream• MQTTUtils.createStream• Custom DStream

• Sc: SparkContext entry point: created by the application or automatically provided by Notebook shell

• sqlCtxt: SQLContext entry point for working with DataFrames and execute SQLQueries

• Create new RDDs by applying transformations to existing one

• map(fn): apply fn to all elements in RDD• flatMap(fn): Same as map, fn can return 0 or more

elements • filter(fn): select only elements for which fn returns

true• reduceByKey• sortByKey• Sample: sample a fraction of data• Union: combine elements of 2 RDDs• Intersection: intersect 2 RDDS• Distinct: remove duplicate elements• ….

• Produce results from running analytics against RDDs

• reduce(fn): perform summary operation on the elements

• collect(): return all elements in an Array • count(): count the number of elements in the

RDD• take(n): return the first n elements in an Array• foreach(fn): execute the fn on all the elements

in the RDD• saveAsTextFile: persist the elements in a text

file• ….


Job Scheduling


Ecosystem of the IBM Analytics for Apache Spark as service


Setup local development Environment

•Pre-requisites- Scala runtime 2.10.4 http://www.scala-lang.org/download/2.10.4.html- Homebrew http://brew.sh/- Scala sbt http://www.scala-sbt.org/download.html - Spark 1.3.1 http://www.apache.org/dyn/closer.lua/spark/spark-1.3.1/spark-1.3.1.tgz

•Detailled instructions here: https://developer.ibm.com/clouddataservices/start-developing-with-spark-and-notebooks/


Setup local development Environment contd..•Create scala project using sbt•Create directories to start from scratch

mkdir helloSpark && cd helloSparkmkdir -p src/main/scalamkdir -p src/main/java mkdir -p src/main/resourcesCreate a subdirectory under src/main/scala directory

mkdir -p com/ibm/cds/spark/sample

•Github URL for the same project https://github.com/ibm-cds-labs/spark.samples


Setup local development Environment contd..•Create HelloSpark.scala using an IDE or a text editor

• Copy paste this code snippetpackage com.ibm.cds.spark.samplesimport org.apache.spark._

object HelloSpark { //main method invoked when running as a standalone Spark Application def main(args: Array[String]) { val conf = new SparkConf().setAppName("Hello Spark") val spark = new SparkContext(conf) println("Hello Spark Demo. Compute the mean and variance of a collection") val stats = computeStatsForCollection(spark); println(">>> Results: ") println(">>>>>>>Mean: " + stats._1 ); println(">>>>>>>Variance: " + stats._2); spark.stop() } //Library method that can be invoked from Jupyter Notebook def computeStatsForCollection( spark: SparkContext, countPerPartitions: Int = 100000, partitions: Int=5): (Double, Double) = { val totalNumber = math.min( countPerPartitions * partitions, Long.MaxValue).toInt; val rdd = spark.parallelize( 1 until totalNumber,partitions); (rdd.mean(), rdd.variance()) }}


Setup local development Environment contd..•Create a file build.sbt under the project root directory:

•Under the project root directory run

Check for helloSpark 2.10-10.jar under the project root directory

name := "helloSpark" version := "1.0" scalaVersion := "2.10.4" libraryDependencies ++= { val sparkVersion = "1.3.1" Seq( "org.apache.spark" %% "spark-core" % sparkVersion, "org.apache.spark" %% "spark-sql" % sparkVersion, "org.apache.spark" %% "spark-repl" % sparkVersion )}

Download all dependencies $sbt update

Compile$sbt compile

Package an application jar file$sbt package


Hello World application on Bluemix Apache Starter


Introduction to Notebooks‣Notebooks allow creation of interactive executable documents that include rich text

with Markdown, executable code with Scala, Python or R, graphics with matplotlib‣Apache Spark provides multiple flavor APIs that can be executed with a REPL shell:

Scala, Python (PYSpark), R‣Multiple open-source implementations available:

- Jupyter: https://jupyter.org- Apache Zeppelin: http://zeppelin-project.org


Notebook walkthrough

‣Sign up on Bluemix https://console.ng.bluemix.net/registration/‣Getting started with Analytics for Apache Spark:

https://www.ng.bluemix.net/docs/services/AnalyticsforApacheSpark/index.html‣You can also follow tutorial here: https://developer.ibm.com/clouddataservices/start-

developing-with-spark-and-notebooks/


Spark Streaming‣“Spark Streaming is an extension of the core Spark API that enables scalable, high-

throughput, fault-tolerant stream processing of live data streams” (http://spark.apache.org/docs/latest/streaming-programming-guide.html)

‣Breakdown the Streaming data into smaller pieces which are then sent to the Spark Engine


Spark Streaming‣Provides connectors for multiple data sources:

- Kafka- Flume- Twitter- MQTT- ZeroMQ

‣Provides API to create custom connectors. Lots of examples available on Github and spark-packages.org


Spark + Twitter + Watson application‣Use Spark Streaming in combination with IBM Watson to perform sentiment

analysis and track how a conversation is trending on Twitter.

‣Use Spark Streaming to create a feed that captures live tweets from Twitter. You can optionally filter the tweets that contain the hashtag(s) of your choice.

‣The tweet data is then enriched in real time with various sentiment scores provided by the Watson Tone Analyzer service (available on Bluemix). This service provides insight into sentiment, or how the author feels.

‣The data is then loaded and analyzed by the data scientist within Notebook.

‣We can also use streaming analytics to feed a real-time web app dashboard


About this sample application

• Github: https://github.com/ibm-cds-labs/spark.samples/tree/master/streaming-twitter• Tutorial: https://developer.ibm.com/clouddataservices/sentiment-analysis-of-twitter-hashtags

• A word about Scala• Scala is Object oriented but also support functional programming style• Bi-directional interoperability with Java• Resources:• Official web site: http://scala-lang.org• Excellent first steps site: http://www.artima.com/scalazine/articles/steps.html• Free e-books: http://readwrite.com/2011/04/30/5-free-b-books-and-tutorials-o


Spark Streaming with “IBM Insight for Twitter” and “Watson Tone Analyzer”

Watson Tone Analyzer Service Bluemix

Producer Stream

Enrich data with Emotion Tone Scores

Processed data

Scala Notebook IPython Notebook

Consumer Stream

Message Hub Service Bluemix

Full Archive Search API

Consumer Spark Topics

Publish topics from Spark analytics results

Event Hub Service Bluemix

Real-Time Dashboard

Data Engineer

Business AnalystC(Suite)

Data Scientist


Building a Spark Streaming application Sentiment analysis with Twitter and Watson Tone Analyzer

‣Configure Twitter and Watson Tone Analyzer1. Configure OAuth credentials for Twitter2. Create a Watson Tone Analyzer Service on Bluemix3. Configure MessageHub Service on Bluemix (Kafka)4. Configure EventHub Service on Bluemix


Configure OAuth credentials for Twitter‣You can follow along the steps in https://developer.ibm.com/clouddataservices/sentiment-analysis-of-twitter-hashtags/#twitter


Create a Watson Tone Analyzer Service on Bluemix ‣You can follow along the steps in https://developer.ibm.com/clouddataservices/sentiment-analysis-of-twitter-hashtags/#bluemix


Building a Spark Streaming application Sentiment analysis with Twitter and Watson Tone Analyzer

‣Work with Twitter data1. Create a Twitter Stream2. Enrich the data with sentiment analysis from Watson Tone Analyzer3. Aggregate data into RDD with enriched Data model4. Create SparkSQL DataFrame and register Table


Create a Twitter Stream

//Hold configuration key/value pairs val config = Map[String, String](

("twitter4j.oauth.consumerKey", Option(System.getProperty("twitter4j.oauth.consumerKey")).orNull ), ("twitter4j.oauth.consumerSecret", Option(System.getProperty("twitter4j.oauth.consumerSecret")).orNull ), ("twitter4j.oauth.accessToken", Option(System.getProperty("twitter4j.oauth.accessToken")).orNull ), ("twitter4j.oauth.accessTokenSecret", Option(System.getProperty("twitter4j.oauth.accessTokenSecret")).orNull ), ("tweets.key", Option(System.getProperty("tweets.key")).getOrElse("")), ("watson.tone.url", Option(System.getProperty("watson.tone.url")).orNull ), ("watson.tone.username", Option(System.getProperty("watson.tone.username")).orNull ), ("watson.tone.password", Option(System.getProperty("watson.tone.password")).orNull ) )

Create a map that stores the credentials for the Twitter and Watson Service

config.foreach( (t:(String,String)) => if ( t._1.startsWith( "twitter4j") ) System.setProperty( t._1, t._2 ))

Twitter4j requires credentials to be store in System properties


Create a Twitter Stream

//Filter the tweets to only keeps the one with english as the language//twitterStream is a discretized stream of twitter4j Status objects

var twitterStream = org.apache.spark.streaming.twitter.TwitterUtils.createStream( ssc, None ) .filter { status => Option(status.getUser).flatMap[String] {

u => Option(u.getLang) }.getOrElse("").startsWith("en") //Allow only tweets that use “en” as the language

&& CharMatcher.ASCII.matchesAllOf(status.getText) //Only pick text that are ASCII&& ( keys.isEmpty || keys.exists{status.getText.contains(_)}) //If User specified #hashtags to monitor

}

Initial DStream of Status Objects


Enrich the data with sentiment analysis from Watson Tone Analyzer

//Broadcast the config to each worker node val broadcastVar = sc.broadcast(config)

val rowTweets = twitterStream.map(status=> { lazy val client = PooledHttp1Client()

val sentiment = callToneAnalyzer(client, status, broadcastVar.value.get("watson.tone.url”).get,broadcastVar.value.get("watson.tone.username").get, broadcastVar.value.get("watson.tone.password").get

)…

}



Enrich the data with sentiment analysis from Watson Tone Analyzer


Data Model |-- author: string (nullable = true) |-- date: string (nullable = true) |-- lang: string (nullable = true) |-- text: string (nullable = true) |-- lat: integer (nullable = true) |-- long: integer (nullable = true) |-- Cheerfulness: double (nullable = true) |-- Negative: double (nullable = true) |-- Anger: double (nullable = true) |-- Analytical: double (nullable = true) |-- Confident: double (nullable = true) |-- Tentative: double (nullable = true) |-- Openness: double (nullable = true) |-- Agreeableness: double (nullable = true) |-- Conscientiousness: double (nullable = true)

DStream of key,value pairs


Aggregate data into RDD with enriched Data model…..

//Aggregate the data from each DStream into the working RDD

rowTweets.foreachRDD( rdd => {

if ( rdd.count() > 0 ){

workingRDD = sc.parallelize( rdd.map( t => t._1 ).collect()).union( workingRDD )

}

})

Initial DStream RowTweets



….Mic

roba

tche

s

Row 1Row 2Row 3Row 4

……

Row n

workingRDDData Model

|-- author: string (nullable = true) |-- date: string (nullable = true) |-- lang: string (nullable = true) |-- text: string (nullable = true) |-- lat: integer (nullable = true) |-- long: integer (nullable = true) |-- Cheerfulness: double (nullable = true) |-- Negative: double (nullable = true) |-- Anger: double (nullable = true) |-- Analytical: double (nullable = true) |-- Confident: double (nullable = true) |-- Tentative: double (nullable = true) |-- Openness: double (nullable = true) |-- Agreeableness: double (nullable = true) |-- Conscientiousness: double (nullable = true)


Create SparkSQL DataFrame and register Table //Create a SparkSQL DataFrame from the aggregate workingRDD

val df = sqlContext.createDataFrame( workingRDD, schemaTweets ) //Register a temporary table using the name "tweets" df.registerTempTable("tweets") println("A new table named tweets with " + df.count() + " records has been correctly created and can be accessed through the SQLContext variable") println("Here's the schema for tweets") df.printSchema() (sqlContext, df)

Row 1Row 2Row 3Row 4

……

Row n

workingRDD

author date lang … Cheerfulness Negative … Conscienti

ousness

John Smith 10/11/2015 – 20:18 en 0.0 65.8 … 25.5

Alfred … en 34.5 0.0 … 100.0

… … … … … …

… … … … … …

… … … … … …

Chris … en 85.3 22.9 … 0.0

Relational SparkSQL Table


Building a Spark Streaming application: Sentiment analysis with Twitter and Watson Tone Analyzer

‣IPython Notebook analysis1. Load the data into an IPython Notebook2. Analytic 1: Compute the distribution of tweets by sentiment scores greater than 60%3. Analytic 2: Compute the top 10 hashtags contained in the tweets4. Analytic 3: Visualize aggregated sentiment scores for the top 5 hashtags


Load the data into an IPython Notebook‣ You can follow along the steps here: https://github.com/ibm-cds-labs/spark.samples/blob/master/streaming-

twitter/notebook/Twitter%20%2B%20Watson%20Tone%20Analyzer%20Part%202.ipynb

Create a SQLContext from a SparkContext

Load from parquet file and create a DataFrame

Create a SQL table and start excuting SQL queries


Analytic 1 - Compute the distribution of tweets by sentiment scores greater than 60%

#create an array that will hold the count for each sentimentsentimentDistribution=[0] * 9#For each sentiment, run a sql query that counts the number of tweets for which the sentiment score is greater than 60%#Store the data in the arrayfor i, sentiment in enumerate(tweets.columns[-9:]): sentimentDistribution[i]=sqlContext.sql("SELECT count(*) as sentCount FROM tweets where " + sentiment + " > 60")\

.collect()[0].sentCount



Use matplotlib to create a bar chart



Bar Chart Visualization


Analytic 2: Compute the top 10 hashtags contained in the tweets

Initial Tweets

RDDFilter

hashtagsKey, value

pair RDD

Reduced map with

countsSorted

Map by key

flatMap filter map reduceByKey sortByKey


Analytic 2: Compute the top 10 hashtags contained in the tweets


Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags

‣Problem:- Compute the mean average all the emotion score for all the top 10 hastags- Format the data in a way that can be consumed by the plot script



#Step 1: Create RDD from tweets dataframe tagsRDD = tweets.map(lambda t: t )

author … Cheerfulness

Jake … 0.0

Scrad … 23.5

Nittya Indika … 84.0

… … …

… … …

Madison … 93.0

tweets (Type: DataFrame)

Row(author=u'Jake', …, text=u’@sarahwag…’, Cheerfulness=0.0, …)Row(author=u’Scrad', …, text=u’ #SuperBloodMoon https://t…’, Cheerfulness=23.5, …)

Row(author=u’ Nittya Indika', …, text=u’ Good mornin! http://t.…’, Cheerfulness=84.0, …)

…

…

Row(author=u’ Madison', …, text=u’ how many nights…’, Cheerfulness=93.0, …)

tagsRDD (Type: RDD)



#Step 2: Filter to only keep the entries that are in top10tags tagsRDD = tagsRDD.filter( lambda t: any(s in t.text for s in [i[0] for i in top10tags] ) )

Row(author=u'Jake', …, text=u’@sarahwag…’, Cheerfulness=0.0, …)Row(author=u’Scrad', …, text=u’ #SuperBloodMoon https://t…’, Cheerfulness=23.5, …)

Row(author=u’ Nittya Indika', …, text=u’ Good mornin! http://t.…’, Cheerfulness=84.0, …)

……

Row(author=u’ Madison', …, text=u’ how many nights…’, Cheerfulness=93.0, …)

Row(author=u'Mike McGuire', text=u'Explains my disappointment #SuperBloodMoon https://t.co/Gfg7vWws5W', …, Conscientiousness=0.0)Row(author=u'Meng_tisoy', text=u’…hihi #ALDUBThisMustBeLove https://t….’,…,Conscientiousness=68.0)Row(author=u'Kevin Contreras', text=u’…SILA! #ALDUBThisMustBeLove', …Conscientiousness=68.0)

……

Row(author=u'abbi', text=u’…excited #ALDUBThisMustBeLove https://t…’,…, Conscientiousness=100.0)


Analytic 3 - Visualize aggregated sentiment scores for the top 5 hashtags #Step 3: Create a flatMap using the expand function defined above, this will be used to collect all the scores

#for a particular tag with the following format: Tag-Tone-ToneScorecols = tweets.columns[-9:]def expand( t ):

ret = [ ] for s in [i[0] for i in top10tags]: if ( s in t.text ): for tone in cols: ret += [s + u"-" + unicode(tone) + ":" + unicode(getattr(t, tone))] return ret tagsRDD = tagsRDD.flatMap( expand )

Row(author=u'Mike McGuire', text=u'Explains my disappointment #SuperBloodMoon https://t.co/Gfg7vWws5W', …, Conscientiousness=0.0)Row(author=u'Meng_tisoy', text=u’…hihi #ALDUBThisMustBeLove https://t….’,…,Conscientiousness=68.0)Row(author=u'Kevin Contreras', text=u’…SILA! #ALDUBThisMustBeLove', …Conscientiousness=68.0)

…Row(author=u'abbi', text=u’…excited #ALDUBThisMustBeLove https://t…’,…, Conscientiousness=100.0)

u'#SuperBloodMoon-Cheerfulness:0.0'

u'#SuperBloodMoon-Negative:100.0’

u'#SuperBloodMoon-Negative:23.5'

…

u'#ALDUBThisMustBeLove-Analytical:85.0’

FlatMap of encoded values



#Step 4: Create a map indexed by Tag-Tone keys tagsRDD = tagsRDD.map( lambda fullTag : (fullTag.split(":")[0], float( fullTag.split(":")[1]) ))

u'#SuperBloodMoon-Cheerfulness:0.0'

u'#SuperBloodMoon-Negative:100.0’

u'#SuperBloodMoon-Negativer:23.5'

…

u'#ALDUBThisMustBeLove-Analytical:85.0’

u'#SuperBloodMoon-Cheerfulness' 0.0

u'#SuperBloodMoon-Negative’ 100.0

u'#SuperBloodMoon-Negative' 23.5

…

u'#ALDUBThisMustBeLove’ 85.0

map



#Step 5: Call combineByKey to format the data as follow #Key=Tag-Tone, Value=(count, sum_of_all_score_for_this_tone) tagsRDD = tagsRDD.combineByKey((lambda x: (x,1)), (lambda x, y: (x[0] + y, x[1] + 1)), (lambda x, y: (x[0] + y[0], x[1] + y[1])))

u'#SuperBloodMoon-Cheerfulness' 0.0

u'#SuperBloodMoon-Negative’ 100.0

u'#SuperBloodMoon-Negative' 23.5

…

u'#ALDUBThisMustBeLove’ 85.0

u'#Supermoon-Confident’ (0.0, 3)

u'#HajjStampede-Tentative’ (0.0, 3)

u'#KiligKapamilya-Conscientiousness’ (290.0, 6)

…

u'#LunarEclipse-Tentative’ (92.0, 4)

CreateCombiner: Create list of tuples (sum,count)

mergeValue: called for each new value (sum, count)

MergeCombiner: reduce part, merge 2 combiners



#Step 6 : ReIndex the map to have the key be the Tag and value be (Tone, Average_score) tuple #Key=Tag #Value=(Tone, average_score) tagsRDD = tagsRDD.map(lambda (key, ab): (key.split("-")[0], (key.split("-")[1], round(ab[0]/ab[1], 2))))

u'#Supermoon-Confident’ (0.0, 3)

u'#HajjStampede-Tentative’ (0.0, 3)

u'#KiligKapamilya-Conscientiousness’ (290.0, 6)

…

u'#LunarEclipse-Tentative’ (92.0, 4)

u'#Supermoon-Confident’ (u'Confident', 0.0)

u'#HajjStampede-Tentative’ (u'Tentative', 0.0)

u'#KiligKapamilya-Conscientiousness’

(u'Conscientiousness', 48.33)

…

u'#LunarEclipse-Tentative’ (u'Tentative', 23.0)



#Step 7: Reduce the map on the Tag key, value becomes a list of (Tone,average_score) tuples tagsRDD = tagsRDD.reduceByKey( lambda x, y : makeList(x) + makeList(y) )

u'#Supermoon-Confident’ (u'Confident', 0.0)

u'#HajjStampede-Tentative’ (u'Tentative', 0.0)

u'#KiligKapamilya-Conscientiousness’

(u'Conscientiousness', 48.33)

…

u'#LunarEclipse-Tentative’ (u'Tentative', 23.0)

u'#HajjStampede' [(u'Tentative', 0.0), (u'Agreeableness', 3.67), …, (u'Cheerfulness', 100.0)]

u'#Supermoon'[(u'Confident', 0.0), (u'Openness', 91.0), …, (u'Agreeableness', 20.33)]

u'#bloodmoon' [(u'Anger', 0.0), (u'Negative', 0.0), …, (u'Openness', 38.0)]

…

u'#KiligKapamilya'

[(u'Conscientiousness', 48.33), (u'Anger', 0.0),... (u'Agreeableness', 10.83)]



#Step 8 : Sort the (Tone,average_score) tuples alphabetically by Tone tagsRDD = tagsRDD.mapValues( lambda x : sorted(x) )

u'#HajjStampede' [(u'Tentative', 0.0), (u'Agreeableness', 3.67), …, (u'Cheerfulness', 100.0)]

u'#Supermoon'[(u'Confident', 0.0), (u'Openness', 91.0), …, (u'Agreeableness', 20.33)]


…

u'#KiligKapamilya'

[(u'Conscientiousness', 48.33), (u'Anger', 0.0),... (u'Agreeableness', 10.83)]

u'#HajjStampede'[(u'Agreeableness', 3.67),(u'Cheerfulness', 100.0),… (u'Tentative', 0.0),]

u'#Supermoon'[(u'Agreeableness', 20.33), (u'Confident', 0.0),..., (u'Openness', 91.0)]


…

u'#KiligKapamilya'

[(u'Agreeableness', 10.83), (u'Anger', 0.0)(u'Conscientiousness', 48.33),,...]



#Step 9 : Format the data as expected by the plotting code in the next cell. #map the Values to a tuple as follow: ([list of tone], [list of average score]) tagsRDD = tagsRDD.mapValues( lambda x : ([elt[0] for elt in x],[elt[1] for elt in x]) )

u'#HajjStampede'[(u'Agreeableness', 3.67),(u'Cheerfulness', 100.0),… (u'Tentative', 0.0),]

u'#Supermoon'[(u'Agreeableness', 20.33), (u'Confident', 0.0),..., (u'Openness', 91.0)]


…

u'#KiligKapamilya'

[(u'Agreeableness', 10.83), (u'Anger', 0.0)(u'Conscientiousness', 48.33),,...]

u'#HajjStampede' ([u'Agreeableness’,u'Cheerfulness’,… u'Tentative’], [3.67, 100.0,…0.0])

u'#Supermoon'([u'Agreeableness’,u'Confident',..., u'Openness’],[20.33, 0.0,… 91.0])

u'#bloodmoon' ([u'Anger’,u'Negative', …, u'Openness’), [0.0, 0.0,…38.0])

…

u'#KiligKapamilya'

([u'Agreeableness’,u'Anger’, u'Conscientiousness',...],[10.83, 0.0,48.33,...])

Value is a tuple of 2 arrays: tones-scores



#Step 10 : Use custom sort function to sort the entries by order of appearance in top10tags def customCompare( key ): for (k,v) in top10tags: if k == key: return v return 0 tagsRDD = tagsRDD.sortByKey(ascending=False, numPartitions=None, keyfunc = customCompare)

u'#HajjStampede' ([u'Agreeableness’,u'Cheerfulness’,… u'Tentative’], [3.67, 100.0,…0.0])

u'#Supermoon'([u'Agreeableness’,u'Confident',..., u'Openness’],[20.33, 0.0,… 91.0])

u'#bloodmoon' ([u'Anger’,u'Negative', …, u'Openness’), [0.0, 0.0,…38.0])

…

u'#KiligKapamilya'

([u'Agreeableness’,u'Anger’, u'Conscientiousness',...],[10.83, 0.0,48.33,...])

u'#Superbloodmon'([u'Agreeableness’,u'Cheerfulness’,… u'Tentative’], [33.97, 19.38,…12.85])

u'#BBWLA'([u'Agreeableness’,u'Confident',..., u'Openness’],[38.33, 12.34,… 21.43])

u'#ALDUBThisMustBeLove'

([u'Anger’,u'Negative', …, u'Openness’), [0.0, 0.0,…62.0])

…

u'#Newmusic'([u'Agreeableness’,u'Anger’, u'Conscientiousness',...],[0.0, 0.0,68.33,...])


Spark Streaming with “IBM Insight for Twitter” and “Watson Tone Analyzer”

Watson Tone Analyzer Service Bluemix

Producer Stream

Enrich data with Emotion Tone Scores

Processed data

Scala Notebook IPython Notebook

Consumer Stream

Message Hub Service Bluemix

Full Archive Search API


Publish topics from Spark analytics results

Event Hub Service Bluemix

Real-Time Dashboard

Data Engineer

Business AnalystC(Suite)

Data Scientist


Real-Time Web app Dashboard

‣Pie chart showing top Hashtags distribution

‣Bar chart showing distribution of tone scores for each of top HashTags


Create a Receiver that subscribes to Kafka topics

Store new record into DStream

Get batch of new records

MessageHub on Bluemix requires Kafka 0.9


Create Kafka DStream

Implicit conversion to add synthetically add method to StreamingContext


Enrich Tweets with Watson Scores

Get Tone scores

Map to new EnrichedTweet Object


Streaming analytics

Prepare for Map/Reduce

Map tag-tone to corresponding score

Compute Count + Average for each score

Map each tag to count + List of scores averages

Reduce


Maintain State between micro-batch RDDs

Maintain State between micro-batches by recomputing count and List of averages


Produce Streaming analytics topic data

Can’t call Kakfa Producer from streaming analytic because not serializablePost message to queue

Process message queue from separate Thread


Real-time web app dashboard

‣Technology used:- Mozaik (https://github.com/plouc/mozaik)- ReactJS, - WebSocket- D3JS/C3JS

‣Consume Topics generated by Spark Streaming analytics


Real-Time Dashboard

Topics:•topHashTags•topHashTags.toneScores


Access MessageHub API through message-hub-rest node module


React Components for Mozaik framework


Demo!


Thank You

apache spark™ + ibm watson + twitter datapalooza sf 2015

Data & Analytics