real time analytics on high velocity streaming data by guangyu wu
TRANSCRIPT
CeADAR
‣ Application development & proof of concept
‣ Business-value driven
‣ Market pull/need-driven
‣ Website: http://ceadar.ie/
University CeADAR Enterprise
CeADAR
Visualisa'on&Analy'cInterfaces
• ‘Beyondthedesktop’• Easeofinterac6on• Changinguserbehaviour
• Passiveanaly6cs
DataManagementforAnaly'cs
• Reducedatamanagementeffortforanaly6cs
• Datavalida6on• Relevanceofeventstorela6onships
• Datacura6on(determiningusefuldata)
• Adap6veETL(Extract,Transform,Load)
AdvancedAnaly'cs
• Causa6onchallenge• Livetopicmonitoring• Socialtrendingandcontextualisa6on
• Con'nuousanaly'cs• SocialIden6tyfingerprin6ng
Overview
‣ Introduce different frameworks:
‣ Spark, Storm, Trident
‣ Continuous Clustering project
‣ Continuous Metrics project
‣ Stream Converge project
Spark‣ Spark is a platform for distributed batch data processing.
‣ Spark includes a number extensions: Spark Streaming, Spark SQL, MLlib, GraphX.
‣ Spark runs batch jobs predominantly in memory.
‣ Spark Streaming manages to integrate stream processing with batch processing by treating a data stream as sequences of small batches of data points, or micro-batches.
‣ Spark Streaming maintains computation states.
Storm‣ A Storm topology is comprised of spouts and bolts.
‣ Storm operates over individual data points.
‣ Storm is designed purely for stream processing.
Trident‣ Trident is a high level programming abstraction built on top of
Storm.
‣ It provides a number of useful functions such as aggregations and filters.
‣ An application can be designed and implemented using these high level abstractions and Trident converts the logic into a standard Storm topology under the hood.
‣ Trident works over micro-batches of data.
‣ Trident also has built-in support for maintaining processing state and state query.
Methodology
Large static batches of messages
Hadoop and off-line batch processing in
Spark
Single messages
Storm
Micro-batches of messages
Spark Streaming,Trident
Discretised streams
Continuous Clustering‣ Use case: real-time SMS spam detection in mobile networks.
‣ Clustering SMS messages based on their content is a good way to identify spam.
‣ Many similar spam messages are sent out over a short period of time.
Continuous Clustering‣ Problem with traditional clustering algorithms…
‣ work off-line over historical data
‣ require multiple passes over the data
‣ not incrementally updatable
‣ are hard to scale to ‘big’ data
‣ CeADAR solution: we developed a novel single pass, scalable data stream clustering algorithm implemented on Storm.
Deployment‣ Our compute cluster is composed of 4 machines. ‣ Each machine:
‣ Intel Xeon CPU E5-2630 0 @ 2.30GHz with 24 cores ‣ 64G memory ‣ 1T disk
‣ Spark, Storm, Hadoop, Kafka, Redis
Continuous Clustering
‣ US tier 1 mobile operator
‣ ~500 messages/second average
‣ ~1,300 messages/second peak
35,913 Near-exact matching
8,160 Matching threshold 75%
Continuous Metrics‣ Evaluate and compare Storm, Storm Trident and Spark Streaming on
the task of computing a set of statistical metrics in real-time over a continuous stream of data.
‣ Evaluate and compare
‣ Throughput: the volume and velocity of data that can be processed on different configurations and hardware.
‣ Latency: the time delay between a new data point being received and the updated metrics being computed.
vs vs
Continuous Metrics‣ High level results overview
‣ Spark Streaming achieves the highest throughput, with Storm at the other end with the lowest throughput.
‣ However, Storm achieves the best latency by a considerable margin. Spark and Trident both exhibit considerably higher latency which is due at least in part to their micro-batch data processing approach.
‣ The evaluation produced many other insights, learnings and recommendations relating to these real-time platforms.
Stream Converge
‣ Current project: process and combine heterogeneous data streams from diverse sources using Spark Streaming.