Real-time Big Data Analytics with Storm
June 2013
Ron Bodkin Founder & CEO, Think Big
No SQL Roadshow| 2
Accelerating Your Time to Value
Strategy and Roadmap
IMAGINE Training
and Education
ILLUMINATE Hands-On
Data Science and Data Engineering
IMPLEMENT
Leading Provider of Data Science and Engineering Services
No SQL Roadshow| 3
Agenda
� What is Storm? � When is Storm appropriate? � API Options � Use Cases
No SQL Roadshow| 4
� DAG Processing of never ending streams of data - Open Source: https://github.com/nathanmarz/storm/wiki - Used at Twitter plus > 24 other companies - Reliable - At Least Once semantics - Think MapReduce for data streams - Java / Clojure based - Bolts in Java and ‘Shell Bolts’ - Not a queue, but usually reads from a queue.
� Related: - S4, CEP
� Compromises - Static topologies & cluster sizing – Avoid messy dynamic rebalancing - Nimbus – SPOF
� Strong Community Support, emerging commercial support
Storm Overview
No SQL Roadshow| 5
� Cluster - Nimbus - Zookeeper - Supervisor, Workers
� Topology � Streams � Tuple � Spout � Bolt � Stream Groupings
- Shuffle, Field � Trident � DRPC
Storm Concepts
No SQL Roadshow| 6
What is Real-Time?
� Low latency - Query response - Data refresh - End-to-end response
� … nanoseconds, milliseconds, seconds, or minutes depending on your problem
No SQL Roadshow| 7
� Better end-user experience - Ex: View an ad, see the counter move.
� Operational intelligence - Low latency analysis - Operational intelligence
� Event response - Content half life measured at 3 hours (H Mason: http://bit.ly/nu7IDw) - Path to additional real-time capabilities - Scalable analysis - Example: Trend analysis to recommend ‘hot’ articles.
Why Real-time?
No SQL Roadshow| 8
� Scalable way to manage queue readers and logic pipeline � Much better than roll your own � Reliable (Message guarantees, fault tolerant) � Multi-node scaling (1MM messages / 10 nodes) � It works � Open source community � See also: https://github.com/nathanmarz/storm/wiki/Rationale
Why Storm?
No SQL Roadshow| 9
� Raw Java API (tuples and raw computation) � Trident – Java DSL (SQL-like primitives) � Python adapter � Closure DSL � RedStorm – Ruby � Wukong – Ruby � Trident-ML � Storm-Esper
Programming Options
No SQL Roadshow| 10
Use Cases
� Scale analysis pipeline � Lively stats � Recommendations � Better predictions � Realtime analytics � Online machine learning � Distributed RPC
No SQL Roadshow| 11
Overall Architecture
NoSQL
Memcache (Tuple fail tracking)
Queue
Hadoop
Ad Serving
LB
Edge
Edge
Impression
S3S3
S3DFS
Archive Logs
ManagementServer
LB
Edge
Edge
RelationalStore
Ad ManagementAd Selling
Storm- Queue Management- Simple Bot Filtering- Real-time Bucketization- Performance Counters- Event Logging
View Ad
CleansingModel TrainingRecommendations
Events
Monitoring & Alerting (Metrics, Alarms, Notifications)
Model Parameters
getPrediction
Performance CountersImpression Buckets
No SQL Roadshow| 12
Integration Patterns
No SQL Roadshow| 13
Analytics Architecture
Storm
Web Server
Time SeriesBucketBoltSimple Bot
Annotator
DFS Adapter
Impression Spout
Time SeriesBuckets (Batch)
Time SeriesBuckets
(Realtime)
ImpressionPrediction
Predictive Model
Parameters
ImpressionsImpressionsImpressions
Hadoop
ImpressionBucketization
Predictive Model
Training
NoSQLBolt
Time
No SQL Roadshow| 14
Analyze Massive Historical
Data Set
Analyze Recent
Past
Realtime Prediction
Solution Approach
Historical Data Set = S3 Analyze = Hadoop + Pig + R
Recent Past = Storm + NoSQL Analyze = R + Web Service
No SQL Roadshow| 15
Example: Real-Time Recommendation Updates
Web Servers
Read & Update
• Requests read single user profile
• May update that profile record • Classic key/value store
problem • Can split transient state into
read/write NoSQL with batch view NoSQL
Batch Log Feed
Batch Model Feed
Hadoop Cluster Batch Export
Classic BI Access Ad Hoc Queries
MPP Database
NoSQL Database
No SQL Roadshow| 16
Real-Time Recommendation Updates w/ Storm
Web Servers
• More complex recommendation logic
• Parallelize analysis across topology
• Processing benefits from distributed logic of streaming
Batch Log Feed
Batch Model Feed
Hadoop Cluster Batch Export
Classic BI Access Ad Hoc Queries
Storm
MPP Database
NoSQL Database
No SQL Roadshow| 17
Social Updates (real-time news feed)
Web Servers
• Pages now integrate many profile records
• Fan-out-on-read or Fan-out-on-write
• Processing benefits from distributed logic of streaming
Batch Log Feed
Batch Model Feed
Hadoop Cluster Batch Export
Classic BI Access Ad Hoc Queries
Storm
MPP Database
NoSQL Database
No SQL Roadshow| 18
Finance: Trade Compliance Monitoring
Order Mgmt System
• Storm reduces market data to usable subset
• Integrates high, medium and low frequency data
• Real-time Storm filter triggers deeper search via NoSQL
Batch Log Feed
Batch Model Feed
Hadoop Cluster
Batch Export
Classic BI Access
Ad Hoc Queries
Storm
MPP Database
NoSQL Database
Market Data Feed
No SQL Roadshow| 19
Operational Intelligence: Stats and Search
Web Servers
• Cache and microbatch updates to multidimensional stats
• Support distributed queries • Cache updates
Hadoop Cluster
Ad Hoc Queries
Storm Sys Logs
Batch System Log Feed
Batch Summary Views
Search & NoSQL DB
No SQL Roadshow| 20
� Easy to develop, hard to debug – start locally - Timeouts
� Storm infinite loop of failures - Use Memcached to count # tuple failures
� At Least Once processing - Hadoop based read-repair job
� Performance Counters not getting flushed - Tick Tuples
� Always ACK � Batching to S3
- Run a compaction & event de-duplication job in Hadoop
Lessons
No SQL Roadshow| 21
Conclusions
� There’s many kinds of real-time problems � Use of Hadoop and/or NoSQL can solve
- Low latency queries - Event response with localized intelligence - Operational intelligence
� Storm is valuable for - Ingesting data within seconds - Complex real-time distributed logic - Operational intelligence
� We’re Hiring!
21 6/12/13