kanthaka - high volume cdr analyzer

Big Data CDR Analyzer

080201N – M.K.P.R. Jayawardhana

080254D – P.K.A.M. Kumara

080331L – W.D.A.I. Paranawithana

080357V – T.D.K. Perera

Project Supervisors- Mr. Thilina Anjitha – hSenid Dr.Shahani Markus Weerawarana

Overview

• Background • Current Situation • Scope and Assumptions • Kanthaka – big data CDR Analyzer System • Technology Comparison - Map Reduce - No SQL Databases • Architecture • Project Plan • Risks and Possible Remedies • References

Background Mobile Promotions

Current Situation

• Promotions based only on their network usage

• Use only active call switch for triggering promotions

• No way of analyzing and processing high volume CDR records

• No efficient CDR analyzing method

• No access to historical data

• Complex rules not supported

&@$*#

to rescue

• Selecting eligible users for both commercial organizations based and network usage based promotions.

Eg- giving 20% discount for pizza lovers within age group 16-40 who have called pizza hut more than 5 times a month

• High volume CDR analysis.

• Near real time selection of eligible users for promotions.

• CDR Analyzer system which

▫ can process 30 million records per day

▫ can produce results within 10-15 seconds

▫ provides a GUI to define dynamic rules

▫ can be used to offer real-time sales promotions

for mobile subscribers

Scope and Assumptions Scope

30 M

Multiple Rules

Offer Promotion

30 M

Single Rule

Select eligibilities for promotion only

Real system operation Operation expect by Kanthaka

Assumptions

• CDR records can be only in .CSV format.

• Event type can be in different types like SMS, Voice call, MMS, USSD, Top-up, GPRS, LBS.

• CDR can be received as batches to the system asynchronously.

• Only 6 attributes out of many attributes will be considered during processing.

Technology Comparison

Lot of data + higher speed

--> Scale out system

Map Reduce Hadoop map-reduce • Can handle lot of data • Latency is high that not suitable where results are expected in near real time

To count words of size of 100KB file Start time = 01.04.44 End time =01.05.12 Total time = 28 sec

DB Technology Comparison

• RDMS

▫ Provide ACID properties

▫ Use sharding to scale up

▫ Managing overhead is huge in scaling up

▫ Performance degrade with higher data load

▫ Less partition tolerant

DB Technology Comparison Ctd.

• NoSQL

▫ Lot of available options(Cassandra, HBase, MongoDB, Hive)

▫ Promised easy scale up(Lot of big users – Facebook, Twitter)

▫ Provide BASE properties under CAP theorem

▫ Hard to model the system into limited data model

▫ Partition tolerant

▫ More memory --> Higher performance

DB Technology Comparison Ctd.

• NewSQL

▫ Provide ACID properties

▫ Familiar relational data model

▫ Options available(ScaleDB, VoltDB)

▫ Totally run on memory, hence need lot of memory

▫ Promised speed

▫ Persistency achieved by replaying logs

With persistency, less restricted hardware, proven performance,

best to try out is NoSQL.

• Cassandra – a key-value pair column family store(Used at Facebook, Twitter, eBay)

• HBase – a key value pair column family store (Facebook)

• MongoDB – document store(Adobe)

• Hive – HDFS based database

YCSB Benchmarks

• With more big users, active mailing lists, most promising technologies (secondary index, counters) best to try out is Cassandra.

Technology selection

Technologies left behind Technologies selected

• Complex Event Processing engines(CEP)

▫ No persistency

• Rules Engine

▫ More layers More latency

• Hadoop

• NoSQL DB- Hbase, MongoDB, Hive

• NoSQL DB - Cassandra

Architecture

Project Plan

Milestones Target date Status

First chapters of final report - Done

ERU abstracts - Accepted

ERU Paper 31/07/2012 Due

Architecture 06/06/2012 Done

Setting up the Cassandra cluster 06/06/2012 Done

GUI for rule define 15/06/2012 On going

Bulk data load to Cassandra 15/06/2012 On going

System Requirement Specification 20/06/2012 Due

Query data from database periodically 26/06/2012 Due

Initial Design Document 27/06/2012 Due

Algorithm for Pre-processing 10/07/2012 Due

Testing 10/07/2012 Due

Final report 10/08/2012 Due

Risks and Possible

Remedies

• NoSQL databases

High performance More memory

Use an external cluster with descent memory

• In the long run

Performance degrade More data

Archiving

• Concurrency issues handling

Low speed Locking database

Use shadow copy

• NoSQL fails to achieve requirements

Options :

NewSQL– VoltDB (totally run on memory)

CEP (Need actions to preserve persistency )

• Handling sudden peaks

Should have an auto balancing mechanism ready

Final Deliverables

• Big Data CDR Analyzer system

• Research Paper

• Final Report

References

• http://www.slideshare.net/gvdinesh/cap-and-base-8169489

• B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears, “Benchmarking cloud serving systems with YCSB,” 2010, pp. 143–154.

Visit us at Kanthaka

http://kanthaka.net63.net/

Thank You!