solving performance problems on hadoop

34
Solving performance problems on Hadoop Moving analytic workloads into production 1 Tyler Mitchell Sr. Software Engineer Actian Center of Excellence

Upload: tyler-mitchell

Post on 16-Apr-2017

847 views

Category:

Data & Analytics


0 download

TRANSCRIPT

1

Solving performance problems on HadoopMoving analytic workloads into production

Tyler MitchellSr. Software EngineerActian Center of Excellence

TopicsHow we got (stuck) herePerformance best practisesSample business casesBenchmarking results

2

3

Actian’s Lineage

Ingres – 1970’s Versant – 1988 ParAccel – 2006

Pervasive – 1982 Vectorwise – 2003

Actian

Actian at a Glance

4

10,000+

8 Countries; 7 US CitiesHQ Palo Alto

400+Employees Customers

3 Businesses

Banking, InsuranceTelecom and Media

Data ManagementData Integration

Big Data Analytics

5

How We Got (Stuck) Here

Accidental Hadoop Tourist – Brief History

6

DataBusiness

Data Capture Data Management & Integration

Analytics

Query & Analyze

Solutions

ProblemSolved

Accidental Hadoop Tourist – Brief History

7

DataBusiness

Data Capture Data Management & Integration

Analytics Solutions

??????

Accidental Hadoop Tourist – Brief History

8

DataBusiness

Data Capture Data Management & Integration

Analytics

???

Solutions

???

Modern, best-in-class analytic database technology provides:

9

Measureable business impact: monetize Big Data to grow revenue, reduce cost, mitigate risk, enable new business

The ability to make data driven business decisions using a massively scalable platformDecisive reduction in the cost of high performance analytics at scalePerformance that can meet all SLAsFull leverage of existing SQL skills while deploying a modern analytic infrastructure

Grow Revenue

Reduce Cost

Mitigate Risk

Create New

Business

Business Solution Architecture Challenges

Wide Ranges of Use Cases

10

Financial Services

Advanced Credit Risk Analytics

across billions of data points

Internet Scale Application

Predictive Analytics across

hundreds of millions of customers

Media

Data Science and Discovery across trillions of IoT events

Dept of Defense

Cyber-Security: Network intrusion

models every second

Credit Card Processing

Fraud detection

every milli-second

11

Performance Best Practises

3 Essential Big Data Concepts

12

0. Take nothing for granted1. Partitioning vs Data skew 2. Data types matter3. Maximize memory / minimize bottlenecks4. Take nothing for granted

13

6 Game Changing Database Innovations

6 Game Changing Database Innovations

14

1. Use the CPU! – Vector Processing2. Minimize bottlenecks – Exploiting Chip Cache3. Got columnar?4. Smarter compression5. Smarter indexing6. Multi-core matters

15

Actian VectorH Innovations

16

Big Data Business Use Cases

Customer 360: Understanding Experience, Driving Revenue

17

Telecom ChallengeVast and growing repository of proprietary click data, customer records, service call records, smart phone and device data GPS location, webserver, telephone, network usage.Queries took minutes or hours, and sometimes never returned at all.Critical business analysis on a consolidated customer 360 data lake was grinding to a halt. The ability to gain deeper market insights, visualization and desired data management and operational optimization was at risk

Customer 360: Initial Architecture

18

Development System• 300+ node cluster• HIVE access• SQL based BI / Data Science• Pre-processed as performance was unacceptable• Views taking days to return snapshot views

Customer 360: Technical Improvements

19

Production Prototype• 30 node cluster (10% of Hive)• Actian Vector on Hadoop solution• SQL based BI / Data Science• No materialized view building required• Join on demand faster than aggregate tables in Hive• Reduced storage requirements• 91TB – two years data, 1100 columns when joined

Customer 360: Understanding Experience, Driving Revenue

20

ResultsCustomer 360 across prior data silosLeveraged for customer retention strategies Predict and take proactive, tailored responsesEnables next gen data-driven troubleshooting, impact analysis and root cause analysis

• Accelerated operations intelligence• Improved customer experience• Reduced customer churn

Impact

Financial Risk: Upgrading Legacy to Meet SLA

21

ChallengeLegacy single-purpose risk application took 3 hours to generate end-of-day risk report, and failed to meet changing SLA’s for reporting risk.In deciding to replace risk application, bank opted to build a multi-purpose risk application, addressing multiple business requirements

Financial Risk: Upgrading Legacy to Meet SLA

22

Legacy System• Single server architecture, MS SSAS, Oracle - ~30 applications• Pre-processing of desired measures exploding data volumes• Cube and Analysis engines being maxed out as they exceed 1.5TB

range • Unable to scale to the desired range of > 200GB/day new data• Impala attempt failed • Highly invested in apps built on Analysis service

Financial Risk: Upgrading Legacy to Meet SLA

23

New Possibilities• Clustered solution – Hadoop 5 and 10 node• No pre-processing cubes, SSAS partly kept• Tested solutions 1TB -> 20TB at a time• Produced interactive queries across large datasets• Focused query results in 2s or less• Processing all data in the database 6s – 80s• 2x nodes ~ 200% speed improvement

Financial Risk: Upgrading Legacy to Meet SLA

24

ResultsIncreased data analyzed by 100X

2–200B rows / 1-20TBReport run in 28 seconds vs. 3 hoursUse of application for:

• Intra-day reporting (surveillance)• End of day reporting

(compliance)• Overnight float investment

options• Annual CCAR Analysis

ActualGoal

25

Delivering the Results With Better Engineering

26

Technical Benchmarks

27

Technical Benchmarks – Single Machine

28

Technical Benchmarks – Single Machine

Technical Benchmarks: VectorH - SQL on Hadoop

29

TPC-H SF1000 *VectorH vs other platforms, faster by how much?Tuned platformsIdentical hardware **

* Not an official TPC result ** 10 nodes, each 2 x Intel 3.0GHz E5-2690v2 CPUs, 256GB RAM, 24x600GB HDD, 10Gb Ethernet, Hadoop 2.6.0

Actian VectorH Delivers More Efficient File Format

30

Better compression & functionality

Vector advantages:• skip blocks via MinMax indexes• sophisticated query processing• efficient block format, esp. 64-

bit int

31

Summary

Conscientious data handling & next gen engineering takes SQL in Hadoop to new levels.

All Hadoop users can move from development into production while delivering compelling business results.

32

Delivering the Results With Better Engineering

VectorH v5 – Spark integration, external table support, and more

33

SIGMOD 2016 Paper

34

Thank [email protected] - @1tylermitchellBlogs at Actian.com - MakeDataUseful.com

Visit us in booth 503