hive present-and-feature-shanghai

© Hortonworks Inc. 2011 – 2015. All Rights Reserved

Apache Hive Present and Future

Yifeng JiangSolutions Engineer, Hortonworks, inc.July 23, 2015


About Me

蒋　燚峰 (Yifeng Jiang)

• Solutions Engineer, Hortonworks inc.

• HBase book author

• Hobbies: hiking, watching movie


Agenda• Apache Hive Present

• How Hive Achieved 100x Performance

• Sub-second Response


Hadoop for the Enterprise: Implement a Modern Data Architecture with HDP

Customer Momentum• 430+ customers (as of March 31, 2015)

• 105 customers added in Q1 2015

Hortonworks Data Platform• Completely open multi-tenant platform for any app & any data.

• A centralized architecture of consistent enterprise services for resource management, security, operations, and governance.

Partner for Customer Success• Open source community leadership focus on enterprise needs

• Unrivaled world class support

• Founded in 2011

• Original 24 architects, developers, operators of Hadoop from Yahoo!

• 600+ Employees

• 1100+ Ecosystem Partners

Apache Project CommittersPMC

Members

Hadoop 27 21

Pig 5 5

Hive 18 6

Tez 16 15

HBase 6 4

Phoenix 4 4

Accumulo 2 2

Storm 3 2

Slider 11 11

Falcon 5 3

Flume 1 1

Sqoop 1 1

Ambari 36 28

Oozie 3 2

Zookeeper 2 1

Knox 13 3

Ranger 11 n/a

TOTAL 164 109


Hortonworks Data Platform (HDP) 2.2 Stack

Hive: SQL on Hadoop


Apache Hive PresentTransaction, Security, Performance


Apache Hive: SQL on Hadoop

• OSS data warehouse built on top of Hadoop

• First Apache Hive released in 2009

• Initial goal was to write MapReduce jobs in SQL– Most query ran from minutes to hours

– Primary used for batch processing


Hive – Single tool for all SQL use cases

OLTP, ERP, CRM Systems

Unstructured documents, emails

Clickstream

Server logs

Sentiment, Web Data

Sensor. Machine Data

Geolocation

Interactive Analytics

Batch Reports /Deep Analytics

Hive - SQL

ETL / ELT

© Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015

Hive Scales to Any Workload

Hive at Facebook• 100+ PB of data under management

• 15+ TB of data loaded daily

• 60,000+ Hive queries per day

• More than 1,000 users per day


TransactionsInsert, Update and Delete SQL Statements


Transaction Use Cases

Reporting with Analytics (YES)Reporting on data with occasional updates

Corrections to the fact tables, evolving dimension tables

Low concurrency updates, low TPS

Operational (OLTP) Database (NO)Small Transactions, each doing single line inserts

High Concurrency - Hundreds to thousands of connections

Hive

OLTP HiveReplication

Analytics Modifications

Hive

High Concurrency OLTP


Deep Dive: Transaction

Transaction Support in Hive with ACID semantics• Hive native support for INSERT, UPDATE, DELETE.• Split Into Phases:

• Phase 1: Hive Streaming Ingest (append)• Phase 2: INSERT / UPDATE / DELETE Support• Phase 3: BEGIN / COMMIT / ROLLBACK Txn

[Done]

[Done]

[Next]

Read-Optimized ORCFile

Delta File Merged Read-

Optimized ORCFile

1. Original FileTask reads the latest

ORCFile

Task


Task Task

2. Edits MadeTask reads the ORCFile and merges

the delta file with the edits

3. Edits MergedTask reads the

updated ORCFile

Hive ACID Compactor periodically merges the delta

files in the background.


Hive Compaction


Delta File Merged Read-

Optimized ORCFile


Delta File

Delta File

Delta File

Minor Compaction10% local

Major Compaction10% global

Minor / Major compaction


SecurityHive User’s perspective


Ranger: Central Security Administration

Apache Ranger• Security dashboard

• Centralizes administration of security policy

• Ensures consistent coverage across the entire Hadoop stack


Setup Authorization Policy (Hive)

16

file level access control, flexible definition

Control permissions


How Hive Achieved 100x PerformanceORC, Tez, CBO, Vectorization


Need for Speed: The Stinger Initiative

Stinger: An Open Roadmap to improve Apache Hive’s performance 100x.

Launched: February 2013; Delivered: April 2014.

Delivered in 100% Apache Open Source.

SQL Engine

Vectorized SQL Engine

ColumnarStorage

ORCFile

= 100X+ +Distributed Execution

Apache Tez


TPC-DS Benchmark at 30 Terabyte Scale

Sample of 50 queries from TPC-DS at 30 terabyte scale.

Average 52x Query Speedup, Maximum 160x Query Speedup.

Total benchmark time decreased from 7.8 days to 9.3 hours.(3)

Cost-Based Optimizer added in Hive 14 gave additional 2.5x Speedup.


ORC File FormatColumnar Storage for Hive


ORCFile – Columnar Storage for Hive

• Columns stored separately

• Knows types– Uses type-specific encoders– Stores statistics (min, max, sum, count)

• Has light-weight index– Skip over blocks of rows that don’t matter


ORCFile – Columnar Storage for Hive

Large block size ideal for map/reduce.

Columnar format enables high compression and high performance.


ORCFile – Create Table

• Defined at table or partition level

• Configurable compression codec

create table Addresses ( name string, street string, city string, state string, zip int) stored as orc tblproperties ("orc.compress"=”ZLIB");


ORCFile – Convert Text to ORC

• Always ORC

• One SQL to convert text to ORC

-- Create Text & ORC tablesCREATE TABLE test_details_txt( visit_id INT, store_id SMALLINT) STORED AS TEXTFILE;CREATE TABLE test_details_orc( visit_id INT, store_id SMALLINT) STORED AS ORC;

-- Load into Text tableLOAD DATA LOCAL INPATH '/home/user/test_details.csv' INTO TABLE test_details_txt;

-- Copy to ORC tableINSERT OVERWRITE INTO test_details_orc SELECT * FROM test_details_txt;


Tez EngineBeyond MapReduce


I/O Synchronization

Barrier

I/O Synchronization

Barrier

Job 1 ( Join a & b )

Job 3 ( Group by of c )

Job 2 (Group by of a Join b)

Job 4 (Join of S & R )

Hive - MR

MR vs. Tez Example

Single Job

Hive - Tez

Join a & b

Group by of a Join b

Group by of c

Job 4 (Join of S & R )


Tez – Introduction

• Distributed execution framework for data-processing applications

– Target for application (framework), not end user

– Hive on Tez, Pig on Tez, Cascading on Tez, …

• Lessons learned from MapReduce– Significant performance improvement– Batch, interactive– Petabytes scale

• Run on YARN– Utilize cluster resource


Tez – Switch from MapReduce

• One command to switch from MapReduce to Tez

set hive.execution.engine=tez;

SELECT * FROM my_table;

• Set Tez as default engine on Hadoop 2

$ vi hive-site.xml

hive.execution.engine=tez


Cost Based OptimizerMaking the SQL smarter


Cost Based Optimizer in Hive

Cost-Based Optimizer (CBO) creates optimized execution plan using Hive table statistics

Why cost-based optimization?

• Simple use – e.g., adjust join order automatically

• Reduce the need for SQL tuning

• Optimized plan relates to better cluster utilization


Performance Improvement – Query 17

Scale = 30TBInput records ~186M

CBO Elapsed Time (sec)

Elapsed Time

Intermediate data (GB)

Output and Intermediate Records

OFF 10,683 ~3 hrs 5,017 135,647,792,123

ON 1,284 ~20 mins 275 8,543,232,360


CBO – Enable CBO

• Enable CBO before submitting query

set hive.cbo.enable=true;set hive.compute.query.using.stats=true;set hive.stats.fetch.column.stats=true;set hive.stats.fetch.partition.stats=true;

• Refresh statistics

ANALYZE TABLE my_table COMPUTE STATISTICS FOR COLUMNS;


Vectorized Query ExecutionProcess 1024 Rows at a Time


Vectorization – Vectorized SQL Engine

• Feature:– Process a block of 1024 rows instead of one row at a time

– Leverage modern hardware architecture

• Benefit:– Max to 3x faster for big query

– Reduce CPU time, utilize cluster resource


Vectorization – Enable Vectorization

• Enable vectorized SQL engine

set hive.vectorized.execution.enabled = trueset hive.vectorized.execution.reduce.enabled = true;

• Support ORC only• A few data types and features are not supported


Hive on Tez: Conclusion

Hive on Tez delivers fast batch and interactive SQL today.

But users need more speed!

Proven at petabyte scale.Scalei

The most comprehensive open-source SQL on Hadoop.

SQLiMore than 90 Hortonworks customers use Hive-on-Tez today for fast SQL.

Speedi

Hortonworks Customer Support metrics as of Feb/2015


Sub-second Query ResponseSolving Hive’s Top Performance Challenges


Next Stop: Stinger.next and Sub-Second SQL

Emergence of LLAP and Hive-on-Spark bring Sub-Second within reach.


Apache Hive: Modern ArchitectureSt

orag

e

Columnar Storage

ORCFile Parquet

Unstructured Data

JSON CSV

Text Avro

Custom

Weblog

Engi

ne

SQL Engines

Row Engine Vector Engine

SQL

SQL Support

SQL:2011 Optimizer HCatalog HiveServer2

Cach

e

Block Cache

Linux Cache

Dist

ribut

edEx

ecuti

on

Hadoop 1

MapReduce

Hadoop 2

Tez

Historical

Current

In Development

Legend


Apache Hive: Modern ArchitectureSt

orag

e

Columnar Storage

ORCFile Parquet

Unstructured Data

JSON CSV

Text Avro

Custom

Weblog

Engi

ne

SQL Engines

Row Engine Vector Engine

SQL

SQL Support

SQL:2011 Optimizer HCatalog HiveServer2

Cach

e

Block Cache

Linux Cache

Dist

ribut

edEx

ecuti

on

Hadoop 1

MapReduce

Hadoop 2

Tez

Vector Cache

LLAP

Persistent Server

Historical

Current

In Development

Legend

© Hortonworks Inc. 2011 – 2015. All Rights ReservedHive & HBase For Transaction Processing

LLAP: What

Node

LLAP Process

HDFS

Query Fragm

ent

LLAP In-Memory columnar cache

LLAP process running read task for a query

LLAP process runs on multiple nodes, accelerating Tez tasks Node

Hive Query

Node NodeNode Node

LLAP LLAP LLAP LLAP

LLAP = Live Long And Process


LLAP: Why?

• LLAP is a node resident daemon process– Low latency by reducing setup cost

• LLAP has in-memory columnar data cache– Hot data sits in memory, not HDFS

– Store data in columnar format for vectorization processing

• Use YARN for resource management– Utilize cluster resource

Node

LLAP Process

Query Fragment

LLAP In-Memory columnar cache

LLAP process running a task for a query

HDFS


Hive Sub-second Response

= Sub-SecondHive

Metadata

Fast, Scalable Metadata

Catalog

Persistent Server

LLAP

+ +SQL Engine

Vectorized Hash Join

Choice of Execution Engines

Tez

+


Key TakeawaysHive Present and Future


Hive Present and Future

• Hive is the de facto standard of SQL on Hadoop

• One tool, batch and interactive processing

• One tool, all big data SQL use cases: ETL, reporting, BI and analytics

• Hive keeps envolving

• SQL:2011 Analytics support

• Enhance transactions

• Sub-second query response


Try Hive Today

• Try Hive latest feature today

• Hive on Tez

• ORC file formant

• CBO

• Vectorization

• Just a few lines of configuration/SQL change

• Stay tuned for Hive evolution


Thank you

Yifeng Jiang, Solutions Engineer, Hortonworks@uprush

hive present-and-feature-shanghai

Software