kanthaka - high volume cdr analyzer
DESCRIPTION
'Kanthaka' is an attempt to bring the benefits of Big Data technologies to telecom industry. The objective of the system is to analyze the CDRs (Caller Detail Record) and give results in near real time. This is carried out as a final year project for my degree B. Sc. of Engineering (Hons) at University of Moratuwa as a team with 3 more colleagues, under the supervision of a senior lecturer and an industry expert. The presentation exhibits the background, findings after literature review and proposing architecture of the system as for now. Any feed backs on improvements that can be made, are warmly welcome!TRANSCRIPT
Big Data CDR Analyzer
080201N – M.K.P.R. Jayawardhana
080254D – P.K.A.M. Kumara
080331L – W.D.A.I. Paranawithana
080357V – T.D.K. Perera
Project Supervisors- Mr. Thilina Anjitha – hSenid Dr.Shahani Markus Weerawarana
Overview
• Background • Current Situation • Scope and Assumptions • Kanthaka – big data CDR Analyzer System • Technology Comparison - Map Reduce - No SQL Databases • Architecture • Project Plan • Risks and Possible Remedies • References
Background Mobile Promotions
Current Situation
• Promotions based only on their network usage
• Use only active call switch for triggering promotions
• No way of analyzing and processing high volume CDR records
• No efficient CDR analyzing method
• No access to historical data
• Complex rules not supported
&@$*#
to rescue
• Selecting eligible users for both commercial organizations based and network usage based promotions.
Eg- giving 20% discount for pizza lovers within age group 16-40 who have called pizza hut more than 5 times a month
• High volume CDR analysis.
• Near real time selection of eligible users for promotions.
• CDR Analyzer system which
▫ can process 30 million records per day
▫ can produce results within 10-15 seconds
▫ provides a GUI to define dynamic rules
▫ can be used to offer real-time sales promotions
for mobile subscribers
Scope and Assumptions Scope
30 M
Multiple Rules
Offer Promotion
30 M
Single Rule
Select eligibilities for promotion only
Real system operation Operation expect by Kanthaka
Assumptions
• CDR records can be only in .CSV format.
• Event type can be in different types like SMS, Voice call, MMS, USSD, Top-up, GPRS, LBS.
• CDR can be received as batches to the system asynchronously.
• Only 6 attributes out of many attributes will be considered during processing.
Technology Comparison
Lot of data + higher speed
--> Scale out system
Map Reduce Hadoop map-reduce • Can handle lot of data • Latency is high that not suitable where results are expected in near real time
To count words of size of 100KB file Start time = 01.04.44 End time =01.05.12 Total time = 28 sec
DB Technology Comparison
• RDMS
▫ Provide ACID properties
▫ Use sharding to scale up
▫ Managing overhead is huge in scaling up
▫ Performance degrade with higher data load
▫ Less partition tolerant
DB Technology Comparison Ctd.
• NoSQL
▫ Lot of available options(Cassandra, HBase, MongoDB, Hive)
▫ Promised easy scale up(Lot of big users – Facebook, Twitter)
▫ Provide BASE properties under CAP theorem
▫ Hard to model the system into limited data model
▫ Partition tolerant
▫ More memory --> Higher performance
DB Technology Comparison Ctd.
• NewSQL
▫ Provide ACID properties
▫ Familiar relational data model
▫ Options available(ScaleDB, VoltDB)
▫ Totally run on memory, hence need lot of memory
▫ Promised speed
▫ Persistency achieved by replaying logs
With persistency, less restricted hardware, proven performance,
best to try out is NoSQL.
• Cassandra – a key-value pair column family store(Used at Facebook, Twitter, eBay)
• HBase – a key value pair column family store (Facebook)
• MongoDB – document store(Adobe)
• Hive – HDFS based database
YCSB Benchmarks
• With more big users, active mailing lists, most promising technologies (secondary index, counters) best to try out is Cassandra.
Technology selection
Technologies left behind Technologies selected
• Complex Event Processing engines(CEP)
▫ No persistency
• Rules Engine
▫ More layers More latency
• Hadoop
• NoSQL DB- Hbase, MongoDB, Hive
• NoSQL DB - Cassandra
Architecture
Project Plan
Milestones Target date Status
First chapters of final report - Done
ERU abstracts - Accepted
ERU Paper 31/07/2012 Due
Architecture 06/06/2012 Done
Setting up the Cassandra cluster 06/06/2012 Done
GUI for rule define 15/06/2012 On going
Bulk data load to Cassandra 15/06/2012 On going
System Requirement Specification 20/06/2012 Due
Query data from database periodically 26/06/2012 Due
Initial Design Document 27/06/2012 Due
Algorithm for Pre-processing 10/07/2012 Due
Testing 10/07/2012 Due
Final report 10/08/2012 Due
Risks and Possible
Remedies
• NoSQL databases
High performance More memory
Use an external cluster with descent memory
• In the long run
Performance degrade More data
Archiving
• Concurrency issues handling
Low speed Locking database
Use shadow copy
• NoSQL fails to achieve requirements
Options :
NewSQL– VoltDB (totally run on memory)
CEP (Need actions to preserve persistency )
• Handling sudden peaks
Should have an auto balancing mechanism ready
Final Deliverables
• Big Data CDR Analyzer system
• Research Paper
• Final Report
References
• http://www.slideshare.net/gvdinesh/cap-and-base-8169489
• B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears, “Benchmarking cloud serving systems with YCSB,” 2010, pp. 143–154.
Visit us at Kanthaka
Thank You!