real time analytics on high velocity streaming data by guangyu wu

Real-time Analytics on High Velocity Streaming Data

Guangyu Wu @CeADAR

CeADAR

‣ Application development & proof of concept

‣ Business-value driven

‣ Market pull/need-driven

‣ Website: http://ceadar.ie/

University CeADAR Enterprise

http://ceadar.ie/

CeADAR

Visualisa'on&Analy'cInterfaces

• ‘Beyondthedesktop’• Easeofinterac6on• Changinguserbehaviour

• Passiveanaly6cs

DataManagementforAnaly'cs

• Reducedatamanagementeffortforanaly6cs

• Datavalida6on• Relevanceofeventstorela6onships

• Datacura6on(determiningusefuldata)

• Adap6veETL(Extract,Transform,Load)

AdvancedAnaly'cs

• Causa6onchallenge• Livetopicmonitoring• Socialtrendingandcontextualisa6on

• Con'nuousanaly'cs• SocialIden6tyfingerprin6ng

Overview

‣ Introduce different frameworks:

‣ Spark, Storm, Trident

‣ Continuous Clustering project

‣ Continuous Metrics project

‣ Stream Converge project

Spark‣ Spark is a platform for distributed batch data processing.

‣ Spark includes a number extensions: Spark Streaming, Spark SQL, MLlib, GraphX.

‣ Spark runs batch jobs predominantly in memory.

‣ Spark Streaming manages to integrate stream processing with batch processing by treating a data stream as sequences of small batches of data points, or micro-batches.

‣ Spark Streaming maintains computation states.

Storm‣ A Storm topology is comprised of spouts and bolts.

‣ Storm operates over individual data points.

‣ Storm is designed purely for stream processing.

Trident‣ Trident is a high level programming abstraction built on top of

Storm.

‣ It provides a number of useful functions such as aggregations and filters.

‣ An application can be designed and implemented using these high level abstractions and Trident converts the logic into a standard Storm topology under the hood.

‣ Trident works over micro-batches of data.

‣ Trident also has built-in support for maintaining processing state and state query.

Methodology

Large static batches of messages

Hadoop and off-line batch processing in

Spark

Single messages

Storm

Micro-batches of messages

Spark Streaming,Trident

Discretised streams

Continuous Clustering‣ Use case: real-time SMS spam detection in mobile networks.

‣ Clustering SMS messages based on their content is a good way to identify spam.

‣ Many similar spam messages are sent out over a short period of time.

Continuous Clustering‣ Problem with traditional clustering algorithms…

‣ work off-line over historical data

‣ require multiple passes over the data

‣ not incrementally updatable

‣ are hard to scale to ‘big’ data

‣ CeADAR solution: we developed a novel single pass, scalable data stream clustering algorithm implemented on Storm.

Continuous Clustering

https://ceadar.ucd.ie/demo/continuousclustering/

Deployment‣ Our compute cluster is composed of 4 machines. ‣ Each machine:

‣ Intel Xeon CPU E5-2630 0 @ 2.30GHz with 24 cores ‣ 64G memory ‣ 1T disk

‣ Spark, Storm, Hadoop, Kafka, Redis

Continuous Clustering

‣ US tier 1 mobile operator

‣ ~500 messages/second average

‣ ~1,300 messages/second peak

35,913 Near-exact matching

8,160 Matching threshold 75%

Continuous Metrics‣ Evaluate and compare Storm, Storm Trident and Spark Streaming on

the task of computing a set of statistical metrics in real-time over a continuous stream of data.

‣ Evaluate and compare

‣ Throughput: the volume and velocity of data that can be processed on different configurations and hardware.

‣ Latency: the time delay between a new data point being received and the updated metrics being computed.

vs vs

Sliding Windows‣ By items

‣ By time

Continuous Metrics‣ High level results overview

‣ Spark Streaming achieves the highest throughput, with Storm at the other end with the lowest throughput.

‣ However, Storm achieves the best latency by a considerable margin. Spark and Trident both exhibit considerably higher latency which is due at least in part to their micro-batch data processing approach.

‣ The evaluation produced many other insights, learnings and recommendations relating to these real-time platforms.

Stream Converge

‣ Current project: process and combine heterogeneous data streams from diverse sources using Spark Streaming.

Stream Converge

‣ Challenges:

‣ managing data streams of different frequency.

‣ linking together events across different streams via complex key relationships.

‣ handling out of order arrival of data.

‣ ……

real time analytics on high velocity streaming data by guangyu wu

Technology