new world hadoop architectures (& what problems they really solve) for oracle dbas
TRANSCRIPT
Mark Rittman, Oracle ACE Director
NEW WORLD HADOOP ARCHITECTURES (& WHAT PROBLEMS THEY REALLY SOLVE) FOR DBASUKOUG DATABASE SIG MEETING
London, February 2017
•Oracle ACE Director, Independent Analyst •Past ODTUG Exec Board Member + Oracle Scene Editor •Author of two books on Oracle BI •Co-founder & CTO of Rittman Mead •15+ Years in Oracle BI, DW, ETL + now Big Data •Host of the Drill to Detail Podcast (www.drilltodetail.com) •Based in Brighton & work in London, UK
About The Presenter
2
“Hi Mark, In things I have seen and read quite o6en people start with a high-level overview of a product (e.g. Hadoop, Ka@a), then describe the technical concepts (using all the appropriate terminology) …”
“but I am usually le6 missing something. I think it's around the area of what problems these technologies are solving and how they are doing it? Without that context I'm finding it all very academic”
“Many people say tradiKonal systems will sKll be needed. Are these new technologies solving completely different problems to those handled by tradi=onal IT? Is there an overlap?”
•Started back in 1996 on a bank Oracle DW project •Our tools were Oracle 7.3.4, SQL*Plus, PL/SQL and shell scripts •Data warehouses provided a unified view of the business •Single place to store key data and metrics •Joined-up view of the business •Aggregates and conformed dimensions •ETL routines to load, cleanse and conform data
•BI tools for simple, guided access to information •Tabular data access using SQL-generating tools •Drill paths, hierarchies, facts, attributes •Fast access to pre-computed aggregates •Packaged BI for fast-start ERP analytics
20 Years in Old-school BI & Data Warehousing
5
•Google needed to store and query their vast amount of server log files •And wanted to do so using cheap, commodity hardware •Google File System and MapReduce designed together for this use
Google File System and MapReduce
12
•GFS optimised for particular task at hand - computing PageRank for sites •Streaming reads for PageRank calcs, block writes for crawler whole-site dumps
•Master node only holds metadata •Stops client/master I/O being bottleneck, also acts as traffic controller for clients
•Simple design, optimised for specific Google Need •MapReduce focused on simple computations on
abstraction framework •Select & filter (MAP) and reduce (aggregate) functions, easily to distribute on cluster
•MapReduce abstracted cluster compute, HDFS abstracted cluster storage
•Projects that inspired Apache Hadoop + HDFS
Google File System + MapReduce Key Innovations
13
How Traditional RDBMS Data Warehousing Scaled-Up
14
Shared-EverythingArchitectures(i.e.OracleRAC,Exadata)
Shared-NothingArchitectures(e.g.Teradata,Netezza)
•Enterprise High-End RDBMSs such as Oracle can scale •Clustering for single-instance DBs can scale to >PB •Exadata scales further by offloading queries to storage •Sharded databases (e.g. Netezza) can scale further
•But cost (and complexity) become limiting factors •Typically $1m/node is not uncommon
Cost and Complexity around Scaling DW Clusters
17
•A way of storing (non-relational) data cheaply and easily expandable •Gave us a way of scaling beyond TB-size without paying $$$ •First use-cases were offline storage, active archive of data
Hadoop’s Original Appeal to Data Warehouse Owners
18
(c) 2013
Hadoop Ecosystem Expanded Beyond MapReduce
19
•Core Hadoop, MapReduce and HDFS •HBase and other NoSQL Databases •Apache Hive and SQL-on-Hadoop •Storm, Spark and Stream Processing •Apache YARN and Hadoop 2.0
•Solution to the problem of storing semi-structured data at-scale •Built on Google File System •Scale for capacity e.g., webtable •100,000,000,000 pages, •10 versions per page, •20 KB / version = 20 PB of data
•Scale for throughput •Hundreds of millions of users •Tens of thousands to millions of queries/sec
•At low-latency with high-reliability
Google BigTable, HBase and NoSQL Databases
20
•Optimised for a particular task - fast lookups of ts-versioned web data
•Data stored in multidimensional map keyed on row, column + timestamp
•Master + data tablets stored on GFS cluster nodes
•Simple key/value lookup with client doing interpretation
•Innovation - focus on single job with different needs to OLTP
•Formed inspiration for Apache HBase
How BigTable Scaled Beyond Traditional RDBMSs
21
•Original developed at Facebook, now foundational within Hadoop •SQL-like language that compiles to MapReduce, Spark, HBase •Solved the problem of enabling non-programmers to access big data •And made Hadoop data transformation and aggregation code more productive
•JDBC and ODBC drivers for tool integration
Hive - Hadoop Discovers Set-Based Processing
22
•Hive is extensible to help with accessing and integrating new data sets •SerDes : Serializer-Deserializers that interpret semi-structured sources •UDFs + Hive Streaming : User-defined functions and streaming input •File Formats : make use of compressed and/or optimised file storage •Storage Handlers : use storage other than HDFS (e.g. MongoDB)
Apache Hive as SQL Access Engine For Everything
23
•Hadoop as low-cost ETL pre-processing engine - “ETL-offload” •NoSQL database for landing real-time data at high speed/low latency •Incoming data then aggregated and stored in RBDMS DW
Common Hadoop/NoSQL Use-Case (c) 2014
24
MartsData Warehouse
Σ Σ
BusinessIntelligence
• Online• Scalable• Flexible• Cost
Effective
Hadoop
•Driven by pace of business, and user demands for more agility and control •Traditional IT-governed data loading not always appropriate •Not all data needed to be modelled right-away •Not all data suited storing in tabular form •New ways of analyzing data beyond SQL •Graph analysis •Machine learning
Data Warehousing and ETL Needed Some Agility
29
•Storing data in format it arrived in, and then applying schema at query time •Suits data that may be analysed in different ways by different tools •In addition, some datatypes may have schema embedded in file format •Key benefit - fast arriving data of unknown value can get to users earlier •Made possible by tools such as Apache Hive + SerDes, Apache Drill and self-describing file formats, HDFS storage
Advent of Schema-on-Read, and Data Lakes
31
•Data now landed in Hadoop clusters, NoSQL databases and Cloud Storage •Flexible data storage platform with cheap storage, flexible schema support + compute •Solves the problem of how to store new types of data + choose best time/way to process it •Hadoop/NoSQL increasingly used for all store/transform/query tasks
Meet the New Data Warehouse : The “Data Lake”
32
DataTransfer DataAccess
DataFactory DataReservoir
BusinessIntelligenceTools
HadoopPlatform
FileBasedIntegration
StreamBased
Integration
Datastreams
Discovery&DevelopmentLabsSafe&secureDiscoveryandDevelopment
environment
Datasetsandsamples
Models andprograms
Marketing/SalesApplications
Models
MachineLearning
Segments
OperationalData
Transactions
CustomerMasterata
UnstructuredData
Voice+ChatTranscripts
ETLBasedIntegration
RawCustomerData
Datastoredintheoriginal
format(usuallyfiles)suchasSS7,ASN.1,JSONetc.
MappedCustomerData
Datasetsproducedbymappingandtransformingrawdata
Hadoop 2.0 and YARN (“Yet Another Resource Negotiator”)
Key Innovation : Separating how data is stored, from how it is processed
•Hadoop started by being synonymous with MapReduce, and Java coding •But YARN (Yet another Resource Negotiator) broke this dependency •Hadoop now just handles resource management •Multiple different query engines can run against data in-place •General-purpose (e.g. MapReduce) •Graph processing •Machine Learning •Real-Time Processing
Hadoop 2.0 - Enabling Multiple Query Engines
35
•New generation of big data platform services from Google, Amazon, Oracle •Combines three key innovations from earlier technologies: •Organising of data into tables and columns (from RDBMS DWs) •Massively-scalable and distributed storage and query (from Big Data) •Elastically-scalable Platform-as-a-Service (from Cloud)
Elastically-Scalable Data Warehouse-as-a-Service
38
•On-premise Hadoop, even with simple resilient clustering, will hit limits •Clusters can reach 5000+ nodes, need to scale-up for demand peaks etc •Scale limits are encountered way beyond those for DWs… •… but future is elastically-scaled, query and compute-as-a-service
What Problem Did Analytics-as-a-Service Solve?
42
OracleBigDataCloudComputeEditionFree$300developercreditat:https://cloud.oracle.com/en_US/tryit
•And things come full-circle … analytics typically requires tabular data
•Google BigQuery based-on DremelX massively-parallel query engine
•But stores data columnar and provides SQL interface
•Solves the problem of providing DW-like functionality at scale, as-a-service
•This is the future … ;-)
BigQuery : Big Data Meets Data Warehousing
43