introduction to twitter storm
DESCRIPTION
Talk held at the FrOSCon 2013 on 24.08.2013 in Sankt Augustin, Germany Agenda: - Why Twitter Storm? - What is Twitter Storm? - What to do with Twitter Storm?TRANSCRIPT
Sankt Augustin24-25.08.2013
Introduction to Twitter Storm
uweseiler
Sankt Augustin24-25.08.2013 About me
Big Data Nerd
TravelpiratePhotography Enthusiast
Hadoop Trainer MongoDB Author
Sankt Augustin24-25.08.2013 About us
is a bunch of…
Big Data Nerds Agile Ninjas Continuous Delivery Gurus
Enterprise Java Specialists Performance Geeks
Join us!
Sankt Augustin24-25.08.2013 Agenda
• Why Twitter Storm?
• What is Twitter Storm?
• What to do with Twitter Storm?
Sankt Augustin24-25.08.2013 The 3 V’s of Big Data
VarietyVolume Velocity
Sankt Augustin24-25.08.2013 Velocity
Sankt Augustin24-25.08.2013 Why Twitter Storm?
Sankt Augustin24-25.08.2013 Batch vs. Real-Time processing
• Batch processing – Gathering of data and processing as a
group at one time.
• Real-time processing– Processing of data that takes place as the
information is being entered.
Sankt Augustin24-25.08.2013 Lambda architecture
Sankt Augustin24-25.08.2013 Bridging the gap…
• A batch workflow is too slow• Views are out of date
Absorbed into batch views
Time
Not Absorbed
Now
Just a few hoursof data
Sankt Augustin24-25.08.2013 Storm vs. Hadoop
• Real-time processing
• Topologies run forever
• No SPOF• Stateless nodes
• Batch processing• Jobs run to
completion• NameNode is SPOF• Stateful nodes
• Scalable• Gurantees no dataloss
• Open Source
Sankt Augustin24-25.08.2013 Stream Processing
Stream processing is a technical paradigm to process big volumes of unbound sequence of tuples in real-time
Source Stream Processing
• Algorithmic trading• Sensor data monitoring• Continuous analytics
Sankt Augustin24-25.08.2013 Example: Stream of tweets
https://github.com/colinsurprenant/tweitgeist
Sankt Augustin24-25.08.2013 Agenda
• Why Twitter Storm?
• What is Twitter Storm?
• What to do with Twitter Storm?
Sankt Augustin24-25.08.2013 Welcome, Twitter Storm!
• Created by Nathan Marz @ BackType– Analyze tweets, links, users on Twitter
• Open sourced on 19th September, 2011– Eclipse Public License 1.0– Storm v0.5.2
• Latest Updates– Current stable release v0.8.2 released on 11th January,
2013– Major core improvements planned for v0.9.0– Storm will be an Apache Project [soon..]
Sankt Augustin24-25.08.2013 Storm under the hood
• Java & Clojure
• Apache Thrift– Cross language bridge, RPC, Framework to build
services
• ZeroMQ– Asynchronous message transport layer
• Kryo– Serialization framework
• Jetty– Embedded web server
Sankt Augustin24-25.08.2013 Conceptual view
Spout
Spout
Spout:Source of streams
Bolt
Bolt
Bolt
Bolt
Bolt
Bolt:Consumer of streams,Processing of tuples,Possibly emits new tuples
Tuple
Tuple
TupleTuple:
List of name-value pairs
Stream:Unbound sequence of tuples
Topology: Network of Spouts & Bolts as the nodes and stream as the edge
Sankt Augustin24-25.08.2013 Physical view
Java thread spawned by worker, runs one or more tasks of the same component
Nimbus
ZooKeeper
WorkerSupervisor
Executor Task
ZooKeeper
ZooKeeper
Supervisor
Supervisor
Supervisor
Supervisor
Worker
Worker
Worker Node
Worker Process
Java process executing a subset of topology
Component (Spout/Bolt) instance, performs the actual data processing
Master daemon process
Responsible for• distributing code• assigning tasks• monitoring failures
Storing operational cluster state
Worker daemon process listening for work assigned to its node
Sankt Augustin24-25.08.2013 A simple example: WordCount
FileReaderSpout
WordSplitBolt
WordCountBolt
line
shakespeare.txt
word
of: 18126to: 18763i: 19540and: 26099the: 27730
Sorted list
Sankt Augustin24-25.08.2013 FileReaderSpout I
package de.codecentric.storm.wordcount.spouts;
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.util.Map;
import backtype.storm.spout.SpoutOutputCollector;
import backtype.storm.task.TopologyContext;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.topology.base.BaseRichSpout;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Values;
public class FileReaderSpout extends BaseRichSpout {
private SpoutOutputCollector collector;
private FileReader fileReader;
private boolean completed = false;
public void ack(Object msgId) {
System.out.println("OK:" + msgId);
}
public void fail(Object msgId) {
System.out.println("FAIL:" + msgId);
}
Sankt Augustin24-25.08.2013 FileReaderSpout II
/**
* Declare the output field "line"
*/
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("line"));
}
/**
* We will read the file and get the collector object
*/
public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {
try {
this.fileReader = new FileReader(conf.get("wordsFile").toString());
} catch (FileNotFoundException e) {
throw new RuntimeException("Error reading file ["
+ conf.get("wordFile") + "]");
}
this.collector = collector;
}
public void close() {
}
Sankt Augustin24-25.08.2013 FileReaderSpout III
/**
* The only thing that the methods will do is emit each file line
*/
public void nextTuple() {
/**
* The nextuple it is called forever, so if we have read the file we
* will wait and then return
*/
String str;
// Open the reader
BufferedReader reader = new BufferedReader(fileReader);
try {
// Read all lines
while ((str = reader.readLine()) != null) {
/**
* Emit each line as a value
*/
this.collector.emit(new Values(str), str);
}
} catch (Exception e) {
throw new RuntimeException("Error reading tuple", e);
} finally {
completed = true;
}
}
}
Sankt Augustin24-25.08.2013 WordSplitBolt I
package de.codecentric.storm.wordcount.bolts;
import backtype.storm.topology.BasicOutputCollector;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.topology.base.BaseBasicBolt;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Tuple;
import backtype.storm.tuple.Values;
public class WordSplitBolt extends BaseBasicBolt {
public void cleanup() {}
/**
* The bolt will only emit the field "word"
*/
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
Sankt Augustin24-25.08.2013 WordSplitBolt II
/**
* The bolt will receive the line from the
* words file and process it to split it into words
*/
public void execute(Tuple input, BasicOutputCollector collector) {
String sentence = input.getString(0);
String[] words = sentence.split(" ");
for(String word : words){
word = word.trim();
if(!word.isEmpty()){
word = word.toLowerCase();
collector.emit(new Values(word));
}
}
}
Sankt Augustin24-25.08.2013 WordCountBolt I
package de.codecentric.storm.wordcount.bolts;
import java.util.Comparator;
import java.util.HashMap;
import java.util.Map;
import java.util.SortedSet;
import java.util.TreeSet;
import backtype.storm.task.TopologyContext;
import backtype.storm.topology.BasicOutputCollector;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.topology.base.BaseBasicBolt;
import backtype.storm.tuple.Tuple;
public class WordCountBolt extends BaseBasicBolt {
/**
*
*/
private static final long serialVersionUID = 1L;
Integer id;
String name;
Map<String, Integer> counters;
Sankt Augustin24-25.08.2013 WordCountBolt II
/**
* On create
*/
@Override
public void prepare(Map stormConf, TopologyContext context) {
this.counters = new HashMap<String, Integer>();
this.name = context.getThisComponentId();
this.id = context.getThisTaskId();
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
}
@Override
public void execute(Tuple input, BasicOutputCollector collector) {
String str = input.getString(0);
/**
* If the word doesn't exist in the map we will create this, if not we will add 1
*/
if (!counters.containsKey(str)) {
counters.put(str, 1);
} else {
Integer c = counters.get(str) + 1;
counters.put(str, c);
}
}
Sankt Augustin24-25.08.2013 WordCountBolt III
/**
* At the end of the spout (when the cluster is shutdown we will show the
* word counters
*/
@Override
public void cleanup() {
// Sort map
SortedSet<Map.Entry<String, Integer>> sortedCounts = entriesSortedByValues(counters);
System.out.println("-- Word Counter [" + name + "-" + id + "] --");
for (Map.Entry<String, Integer> entry : sortedCounts) {
System.out.println(entry.getKey() + ": " + entry.getValue());
}
}
…
}
Sankt Augustin24-25.08.2013 WordCountTopology
public class WordCountTopology {
public static void main(String[] args) throws InterruptedException {
// Topology definition
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("word-reader",new FileReaderSpout());
builder.setBolt("word-normalizer", new WordSplitBolt())
.shuffleGrouping("word-reader");
builder.setBolt("word-counter", new WordCountBolt(),1)
.fieldsGrouping("word-normalizer", new Fields("word"));
// Configuration
Config conf = new Config();
conf.put("wordsFile", args[0]);
conf.setDebug(false);
// Run Topology
conf.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 1);
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("word-count-topology", conf, builder.createTopology());
// You don‘t do this on a regular topology
Utils.sleep(10000);
cluster.killTopology("word-count-topology");
cluster.shutdown();
}
}
Sankt Augustin24-25.08.2013 Stream Grouping
• Each Spout or Bolt might be running n instances in parallel
• Groupings are used to decide to which task in the subscribing bolt (group) a tuple is sent to.
• Possible Groupings:
Grouping FeatureShuffle Random grouping
Fields Grouped by value such that equal value results in same task
All Replicates to all tasks
Global Makes all tuples go to one task
None Makes Bolt run in the same thread as the Bolt / Spout it subscribes to
Direct Producer (task that emits) controls which Consumer will receive
Local If the target bolt has one or more tasks in the same worker process, tuples will be shuffled to just those in-process tasks
Sankt Augustin24-25.08.2013 Key features of Twitter Storm
Storm is• Fast & scalable• Fault-tolerant• Guaranteeing message processing• Easy to setup & operate• Free & Open Source
Sankt Augustin24-25.08.2013 Key features of Twitter Storm
Storm is• Fast & scalable• Fault-tolerant• Guaranteeing message processing• Easy to setup & operate• Free & Open Source
Sankt Augustin24-25.08.2013 Extremely performant
Sankt Augustin24-25.08.2013 Parallelism
Number of worker nodes = 2Number of worker slots per node = 4Number of topology worker = 4
FileReaderSpout WordSplitBolt WordCountBolt
Number of tasks = Not specified = Same as parallism hint
Parellism_hint = 2
Number of tasks = 8
Parellism_hint = 4
Number of tasks = Not specified = 6
Parellism_hint = 6
Number of component instances = 2 + 8 + 6 = 16Number of executor threads = 2 + 4 + 6 = 12
Sankt Augustin24-25.08.2013 Message passing
ReceiveThread
Executor
Transfer ThreadExecutor
Executor
Receiver queue
To other workers
From other workers
Internal transfer queue
Transfer queue
Interprocess communication is mediated by ZeroMQOutside transfer is done with Kryo serialization
Local communication is mediated by LMAX DisruptorInside transfer is done with no serialization
Sankt Augustin24-25.08.2013 Key features of Twitter Storm
Storm is• Fast & scalable• Fault-tolerant• Guaranteeing message processing• Easy to setup & operate• Free & Open Source
Sankt Augustin24-25.08.2013 Fault tolerance
Nimbus ZooKeeper Supervisor Worker
Cluster works normally
Monitoringcluster state
Synchronizingassignment
Sending heartbeat
Reading worker heartbeat from local filesystem
Sending executor heartbeat
Sankt Augustin24-25.08.2013 Fault tolerance
Nimbus ZooKeeper Supervisor Worker
Nimbus goes down
Monitoringcluster state
Synchronizingassignment
Sending heartbeat
Reading worker heartbeat from local filesystem
Sending executor heartbeat
Processing will still continue. But topology lifecycle operations and reassignment facility are lost
Sankt Augustin24-25.08.2013 Fault tolerance
Nimbus ZooKeeper Supervisor Worker
Worker node goes down
Monitoringcluster state
Sending executor heartbeat
Nimbus will reassign the tasks to other machines and the processing will continue
Supervisor Worker
Synchronizingassignment
Sending heartbeat
Reading worker heartbeat from local filesystem
Sankt Augustin24-25.08.2013 Fault tolerance
Nimbus ZooKeeper Supervisor Worker
Supervisor goes down
Monitoringcluster state
Synchronizingassignment
Sending heartbeat
Reading worker heartbeat from local filesystem
Sending executor heartbeat
Processing will still continue. But assignment is never synchronized
Sankt Augustin24-25.08.2013 Fault tolerance
Nimbus ZooKeeper Supervisor Worker
Worker process goes down
Monitoringcluster state
Synchronizingassignment
Sending heartbeat
Reading worker heartbeat from local filesystem
Sending executor heartbeat
Supervisor will restart the worker process and the processing will continue
Sankt Augustin24-25.08.2013 Key features of Twitter Storm
Storm is• Fast & scalable• Fault-tolerant• Guaranteeing message processing• Easy to setup & operate• Free & Open Source
Sankt Augustin24-25.08.2013 Reliability API
public class FileReaderSpout extends BaseRichSpout {
public void nextTuple() {
…;
UUID messageID = getMsgID();
collector.emit(newValues(line), msgId)
}
public void ack(Object msgId) {
// Do something with acked message id
}
public void fail(Object msgId) {
// Do something with failes message id
}
}
public class WordSplitBolt extends BaseBasicBolt {
public void execute(Tuple input, BasicOutputCollector collector) {
for (String s : input.getString(0).split("\\s")) {
collector.emit(input, newValues(s));
}
collector.ack(input);
}
}
Tupel tree
Anchoring incoming tuple to outgoing tuplesSending ack
This
“This is a line”
This
This
This
Emiting tuple with Message ID
Sankt Augustin24-25.08.2013 ACKing Framework
ACKer init
FileReaderSpout WordSplitBolt WordCountBolt
ACKer implicit boltACKer ack
ACKer failACKer ackACKer fail
Tuple A
Tuple B
Tuple C
• Emitted tuple A, XOR tuple A id with ack val• Emitted tuple B, XOR tuple B id with ack val• Emitted tuple C, XOR tuple C id with ack val• Acked tuple A, XOR tuple A id with ack val• Acked tuple B, XOR tuple B id with ack val• Acked tuple C, XOR tuple C id with ack val
Spout Tuple ID Spout Task ID ACK val (64 Bit)
ACKer implizit boltACK val has become 0, ACKer implicit bolt knows the tuple tree has been completed
Sankt Augustin24-25.08.2013 Key features of Twitter Storm
Storm is• Fast & scalable• Fault-tolerant• Guaranteeing message processing• Easy to setup & operate• Free & Open Source
Sankt Augustin24-25.08.2013 Cluster Setup
• Setup ZooKeeper cluster
• Install dependencies on Nimbus and worker machines– ZeroMQ 2.1.7 and JZMQ– Java 6 and Python 2.6.6– unzip
• Download and extract a Storm release to Nimbus and worker machines
• Fill in mandatory configuration into storm.yaml
• Launch daemons under supervision using storm scripts
• Start a topology:
– storm jar <path_topology_jar> <main_class> <arg1>…<argN>
Sankt Augustin24-25.08.2013 Cluster Summary
Sankt Augustin24-25.08.2013 Topology Summary
Sankt Augustin24-25.08.2013 Component Summary
Sankt Augustin24-25.08.2013 Key features of Twitter Storm
Storm is• Fast & scalable• Fault-tolerant• Guaranteeing message processing• Easy to setup & operate• Free & Open Source
Sankt Augustin24-25.08.2013 Basic resources
• Storm is available at– http://storm-project.net/– https://github.com/nathanmarz/storm
under Eclipse Public License 1.0
• Get help on– http://groups.google.com/group/storm-user
– #storm-user freenode room
• Follow@stormprocessor and @nathanmarz
Sankt Augustin24-25.08.2013 Many contributions
• Community repository for modules to use Storm at– https://github.com/nathanmarz/storm-contrib– including integration with Redis, Kafka, MongoDB, HBase, JMS,
Amazon SQS, …
• Good articles for understanding Storm internals– http://www.michael-noll.com/blog/2012/10/16/understanding-the-
parallelism-of-a-stormtopology/– http://www.michael-noll.com/blog/2013/06/21/understanding-storm-
internal-messagebuffers/
• Good slides for understanding real-life examples– http://www.slideshare.net/DanLynn1/storm-as-deep-into-
realtime-data-processing-as-youcan-get-in-30-minutes– http://www.slideshare.net/KrishnaGade2/storm-at-twitter
Sankt Augustin24-25.08.2013 Coming next…
• Current release: 0.8.2
• Work in progress (newest): 0.9.0-wip21– SLF4J and Logback– Pluggable tuple serialization and blowfish
encryption– Pluggable interprocess messaging and Netty
implementation– Some bug fixes– And more
• Storm on YARN
Sankt Augustin24-25.08.2013 Agenda
• Why Twitter Storm?
• What is Twitter Storm?
• What to do with Twitter Storm?
Sankt Augustin24-25.08.2013 One example: Webshop
• Webtracking component
• No defined page impression
• Identifying page impressions usingVarnish logs of the click stream data
• Page consists of different fragments– Body– Article description– Recommendation box, …
• Session data also of interest
Sankt Augustin24-25.08.2013 One example: Webshop
• Custom solution using J2EE andMongoDB
• Export into Comscore DAx andEnterprise DWH
• Solution is currently working but not scalable
• What about performance?
Sankt Augustin24-25.08.2013 Topology Architecture