new world hadoop architectures (& what problems they really solve) for oracle dbas

44
Mark Rittman, Oracle ACE Director NEW WORLD HADOOP ARCHITECTURES (& WHAT PROBLEMS THEY REALLY SOLVE) FOR DBAS UKOUG DATABASE SIG MEETING London, February 2017

Upload: mark-rittman

Post on 21-Mar-2017

659 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Mark Rittman, Oracle ACE Director

NEW WORLD HADOOP ARCHITECTURES (& WHAT PROBLEMS THEY REALLY SOLVE) FOR DBASUKOUG DATABASE SIG MEETING

London, February 2017

•Oracle ACE Director, Independent Analyst •Past ODTUG Exec Board Member + Oracle Scene Editor •Author of two books on Oracle BI •Co-founder & CTO of Rittman Mead •15+ Years in Oracle BI, DW, ETL + now Big Data •Host of the Drill to Detail Podcast (www.drilltodetail.com) •Based in Brighton & work in London, UK

About The Presenter

2

BACK IN FEBRUARY

3

“Hi Mark, In things I have seen and read quite o6en people start with a high-level overview of a product (e.g. Hadoop, Ka@a), then describe the technical concepts (using all the appropriate terminology) …”

“but I am usually le6 missing something. I think it's around the area of what problems these technologies are solving and how they are doing it? Without that context I'm finding it all very academic”

“Many people say tradiKonal systems will sKll be needed. Are these new technologies solving completely different problems to those handled by tradi=onal IT? Is there an overlap?”

•Started back in 1996 on a bank Oracle DW project •Our tools were Oracle 7.3.4, SQL*Plus, PL/SQL and shell scripts •Data warehouses provided a unified view of the business •Single place to store key data and metrics •Joined-up view of the business •Aggregates and conformed dimensions •ETL routines to load, cleanse and conform data

•BI tools for simple, guided access to information •Tabular data access using SQL-generating tools •Drill paths, hierarchies, facts, attributes •Fast access to pre-computed aggregates •Packaged BI for fast-start ERP analytics

20 Years in Old-school BI & Data Warehousing

5

Data Warehousing and BI at “Peak Oracle”

7

Oracle Data Management Platform as of Today

8

What Happened?

10

Let’s Go Back to 2003…

•Google needed to store and query their vast amount of server log files •And wanted to do so using cheap, commodity hardware •Google File System and MapReduce designed together for this use

Google File System and MapReduce

12

•GFS optimised for particular task at hand - computing PageRank for sites •Streaming reads for PageRank calcs, block writes for crawler whole-site dumps

•Master node only holds metadata •Stops client/master I/O being bottleneck, also acts as traffic controller for clients

•Simple design, optimised for specific Google Need •MapReduce focused on simple computations on

abstraction framework •Select & filter (MAP) and reduce (aggregate) functions, easily to distribute on cluster

•MapReduce abstracted cluster compute, HDFS abstracted cluster storage

•Projects that inspired Apache Hadoop + HDFS

Google File System + MapReduce Key Innovations

13

How Traditional RDBMS Data Warehousing Scaled-Up

14

Shared-EverythingArchitectures(i.e.OracleRAC,Exadata)

Shared-NothingArchitectures(e.g.Teradata,Netezza)

Problem #1 That Hadoop / NoSQL Solved :

Scaling Affordably

“Oracle scales infinitely and is free. Period”

•Enterprise High-End RDBMSs such as Oracle can scale •Clustering for single-instance DBs can scale to >PB •Exadata scales further by offloading queries to storage •Sharded databases (e.g. Netezza) can scale further

•But cost (and complexity) become limiting factors •Typically $1m/node is not uncommon

Cost and Complexity around Scaling DW Clusters

17

•A way of storing (non-relational) data cheaply and easily expandable •Gave us a way of scaling beyond TB-size without paying $$$ •First use-cases were offline storage, active archive of data

Hadoop’s Original Appeal to Data Warehouse Owners

18

(c) 2013

Hadoop Ecosystem Expanded Beyond MapReduce

19

•Core Hadoop, MapReduce and HDFS •HBase and other NoSQL Databases •Apache Hive and SQL-on-Hadoop •Storm, Spark and Stream Processing •Apache YARN and Hadoop 2.0

•Solution to the problem of storing semi-structured data at-scale •Built on Google File System •Scale for capacity e.g., webtable •100,000,000,000 pages, •10 versions per page, •20 KB / version = 20 PB of data

•Scale for throughput •Hundreds of millions of users •Tens of thousands to millions of queries/sec

•At low-latency with high-reliability

Google BigTable, HBase and NoSQL Databases

20

•Optimised for a particular task - fast lookups of ts-versioned web data

•Data stored in multidimensional map keyed on row, column + timestamp

•Master + data tablets stored on GFS cluster nodes

•Simple key/value lookup with client doing interpretation

•Innovation - focus on single job with different needs to OLTP

•Formed inspiration for Apache HBase

How BigTable Scaled Beyond Traditional RDBMSs

21

•Original developed at Facebook, now foundational within Hadoop •SQL-like language that compiles to MapReduce, Spark, HBase •Solved the problem of enabling non-programmers to access big data •And made Hadoop data transformation and aggregation code more productive

•JDBC and ODBC drivers for tool integration

Hive - Hadoop Discovers Set-Based Processing

22

•Hive is extensible to help with accessing and integrating new data sets •SerDes : Serializer-Deserializers that interpret semi-structured sources •UDFs + Hive Streaming : User-defined functions and streaming input •File Formats : make use of compressed and/or optimised file storage •Storage Handlers : use storage other than HDFS (e.g. MongoDB)

Apache Hive as SQL Access Engine For Everything

23

•Hadoop as low-cost ETL pre-processing engine - “ETL-offload” •NoSQL database for landing real-time data at high speed/low latency •Incoming data then aggregated and stored in RBDMS DW

Common Hadoop/NoSQL Use-Case (c) 2014

24

MartsData Warehouse

Σ Σ

BusinessIntelligence

• Online• Scalable• Flexible• Cost

Effective

Hadoop

25

Jump Ahead to 2012…

•Driven by pace of business, and user demands for more agility and control •Traditional IT-governed data loading not always appropriate •Not all data needed to be modelled right-away •Not all data suited storing in tabular form •New ways of analyzing data beyond SQL •Graph analysis •Machine learning

Data Warehousing and ETL Needed Some Agility

29

Problem #2 That Hadoop / NoSQL Solved :

Making Data Warehousing Agile

•Storing data in format it arrived in, and then applying schema at query time •Suits data that may be analysed in different ways by different tools •In addition, some datatypes may have schema embedded in file format •Key benefit - fast arriving data of unknown value can get to users earlier •Made possible by tools such as Apache Hive + SerDes, Apache Drill and self-describing file formats, HDFS storage

Advent of Schema-on-Read, and Data Lakes

31

•Data now landed in Hadoop clusters, NoSQL databases and Cloud Storage •Flexible data storage platform with cheap storage, flexible schema support + compute •Solves the problem of how to store new types of data + choose best time/way to process it •Hadoop/NoSQL increasingly used for all store/transform/query tasks

Meet the New Data Warehouse : The “Data Lake”

32

DataTransfer DataAccess

DataFactory DataReservoir

BusinessIntelligenceTools

HadoopPlatform

FileBasedIntegration

StreamBased

Integration

Datastreams

Discovery&DevelopmentLabsSafe&secureDiscoveryandDevelopment

environment

Datasetsandsamples

Models andprograms

Marketing/SalesApplications

Models

MachineLearning

Segments

OperationalData

Transactions

CustomerMasterata

UnstructuredData

Voice+ChatTranscripts

ETLBasedIntegration

RawCustomerData

Datastoredintheoriginal

format(usuallyfiles)suchasSS7,ASN.1,JSONetc.

MappedCustomerData

Datasetsproducedbymappingandtransformingrawdata

Hadoop 2.0 and YARN (“Yet Another Resource Negotiator”)

Key Innovation : Separating how data is stored, from how it is processed

•Hadoop started by being synonymous with MapReduce, and Java coding •But YARN (Yet another Resource Negotiator) broke this dependency •Hadoop now just handles resource management •Multiple different query engines can run against data in-place •General-purpose (e.g. MapReduce) •Graph processing •Machine Learning •Real-Time Processing

Hadoop 2.0 - Enabling Multiple Query Engines

35

Technologies Emerged to Bridge Old/New World

36

FAST FORWARD TO NOW…

37

•New generation of big data platform services from Google, Amazon, Oracle •Combines three key innovations from earlier technologies: •Organising of data into tables and columns (from RDBMS DWs) •Massively-scalable and distributed storage and query (from Big Data) •Elastically-scalable Platform-as-a-Service (from Cloud)

Elastically-Scalable Data Warehouse-as-a-Service

38

… Which Is What I’m Working On Right Now

39

Example Architecture : Google BigQuery

40

41

•On-premise Hadoop, even with simple resilient clustering, will hit limits •Clusters can reach 5000+ nodes, need to scale-up for demand peaks etc •Scale limits are encountered way beyond those for DWs… •… but future is elastically-scaled, query and compute-as-a-service

What Problem Did Analytics-as-a-Service Solve?

42

OracleBigDataCloudComputeEditionFree$300developercreditat:https://cloud.oracle.com/en_US/tryit

•And things come full-circle … analytics typically requires tabular data

•Google BigQuery based-on DremelX massively-parallel query engine

•But stores data columnar and provides SQL interface

•Solves the problem of providing DW-like functionality at scale, as-a-service

•This is the future … ;-)

BigQuery : Big Data Meets Data Warehousing

43

Mark Rittman, Oracle ACE Director

NEW WORLD HADOOP ARCHITECTURES (& WHAT PROBLEMS THEY REALLY SOLVE) FOR DBASUKOUG DATABASE SIG MEETING

London, February 2017