solving performance problems on hadoop
TRANSCRIPT
1
Solving performance problems on HadoopMoving analytic workloads into production
Tyler MitchellSr. Software EngineerActian Center of Excellence
3
Actian’s Lineage
Ingres – 1970’s Versant – 1988 ParAccel – 2006
Pervasive – 1982 Vectorwise – 2003
Actian
Actian at a Glance
4
10,000+
8 Countries; 7 US CitiesHQ Palo Alto
400+Employees Customers
3 Businesses
Banking, InsuranceTelecom and Media
Data ManagementData Integration
Big Data Analytics
Accidental Hadoop Tourist – Brief History
6
DataBusiness
Data Capture Data Management & Integration
Analytics
Query & Analyze
Solutions
ProblemSolved
Accidental Hadoop Tourist – Brief History
7
DataBusiness
Data Capture Data Management & Integration
Analytics Solutions
??????
Accidental Hadoop Tourist – Brief History
8
DataBusiness
Data Capture Data Management & Integration
Analytics
???
Solutions
???
Modern, best-in-class analytic database technology provides:
9
Measureable business impact: monetize Big Data to grow revenue, reduce cost, mitigate risk, enable new business
The ability to make data driven business decisions using a massively scalable platformDecisive reduction in the cost of high performance analytics at scalePerformance that can meet all SLAsFull leverage of existing SQL skills while deploying a modern analytic infrastructure
Grow Revenue
Reduce Cost
Mitigate Risk
Create New
Business
Business Solution Architecture Challenges
Wide Ranges of Use Cases
10
Financial Services
Advanced Credit Risk Analytics
across billions of data points
Internet Scale Application
Predictive Analytics across
hundreds of millions of customers
Media
Data Science and Discovery across trillions of IoT events
Dept of Defense
Cyber-Security: Network intrusion
models every second
Credit Card Processing
Fraud detection
every milli-second
3 Essential Big Data Concepts
12
0. Take nothing for granted1. Partitioning vs Data skew 2. Data types matter3. Maximize memory / minimize bottlenecks4. Take nothing for granted
6 Game Changing Database Innovations
14
1. Use the CPU! – Vector Processing2. Minimize bottlenecks – Exploiting Chip Cache3. Got columnar?4. Smarter compression5. Smarter indexing6. Multi-core matters
Customer 360: Understanding Experience, Driving Revenue
17
Telecom ChallengeVast and growing repository of proprietary click data, customer records, service call records, smart phone and device data GPS location, webserver, telephone, network usage.Queries took minutes or hours, and sometimes never returned at all.Critical business analysis on a consolidated customer 360 data lake was grinding to a halt. The ability to gain deeper market insights, visualization and desired data management and operational optimization was at risk
Customer 360: Initial Architecture
18
Development System• 300+ node cluster• HIVE access• SQL based BI / Data Science• Pre-processed as performance was unacceptable• Views taking days to return snapshot views
Customer 360: Technical Improvements
19
Production Prototype• 30 node cluster (10% of Hive)• Actian Vector on Hadoop solution• SQL based BI / Data Science• No materialized view building required• Join on demand faster than aggregate tables in Hive• Reduced storage requirements• 91TB – two years data, 1100 columns when joined
Customer 360: Understanding Experience, Driving Revenue
20
ResultsCustomer 360 across prior data silosLeveraged for customer retention strategies Predict and take proactive, tailored responsesEnables next gen data-driven troubleshooting, impact analysis and root cause analysis
• Accelerated operations intelligence• Improved customer experience• Reduced customer churn
Impact
Financial Risk: Upgrading Legacy to Meet SLA
21
ChallengeLegacy single-purpose risk application took 3 hours to generate end-of-day risk report, and failed to meet changing SLA’s for reporting risk.In deciding to replace risk application, bank opted to build a multi-purpose risk application, addressing multiple business requirements
Financial Risk: Upgrading Legacy to Meet SLA
22
Legacy System• Single server architecture, MS SSAS, Oracle - ~30 applications• Pre-processing of desired measures exploding data volumes• Cube and Analysis engines being maxed out as they exceed 1.5TB
range • Unable to scale to the desired range of > 200GB/day new data• Impala attempt failed • Highly invested in apps built on Analysis service
Financial Risk: Upgrading Legacy to Meet SLA
23
New Possibilities• Clustered solution – Hadoop 5 and 10 node• No pre-processing cubes, SSAS partly kept• Tested solutions 1TB -> 20TB at a time• Produced interactive queries across large datasets• Focused query results in 2s or less• Processing all data in the database 6s – 80s• 2x nodes ~ 200% speed improvement
Financial Risk: Upgrading Legacy to Meet SLA
24
ResultsIncreased data analyzed by 100X
2–200B rows / 1-20TBReport run in 28 seconds vs. 3 hoursUse of application for:
• Intra-day reporting (surveillance)• End of day reporting
(compliance)• Overnight float investment
options• Annual CCAR Analysis
ActualGoal
Technical Benchmarks: VectorH - SQL on Hadoop
29
TPC-H SF1000 *VectorH vs other platforms, faster by how much?Tuned platformsIdentical hardware **
* Not an official TPC result ** 10 nodes, each 2 x Intel 3.0GHz E5-2690v2 CPUs, 256GB RAM, 24x600GB HDD, 10Gb Ethernet, Hadoop 2.6.0
Actian VectorH Delivers More Efficient File Format
30
Better compression & functionality
Vector advantages:• skip blocks via MinMax indexes• sophisticated query processing• efficient block format, esp. 64-
bit int
31
Summary
Conscientious data handling & next gen engineering takes SQL in Hadoop to new levels.
All Hadoop users can move from development into production while delivering compelling business results.
32
Delivering the Results With Better Engineering
VectorH v5 – Spark integration, external table support, and more
34
Thank [email protected] - @1tylermitchellBlogs at Actian.com - MakeDataUseful.com
Visit us in booth 503