column and hadoop

27
Columnar Database and hadoop 江志伟( Alex Jiang 2012-12-1

Upload: alex-gemini

Post on 20-Jun-2015

997 views

Category:

Documents


0 download

DESCRIPTION

my plan talk at HBTC chinese largest big data technoloy conference,talking about column database and hadoop related area.

TRANSCRIPT

Page 1: Column and hadoop

Columnar Database and hadoop

江志伟( Alex Jiang )2012-12-1

Page 2: Column and hadoop

1. Column Advantage

2. Storage and Process

3. Hadoop Related

Agenda

Page 3: Column and hadoop

2001 PAX

Mike Stonebraker, Daniel Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, …

C-Store: A Column Oriented DBMS

D. J. Abadi, etc: Integrating Compression and Execution in Column-Oriented Database Systems. In SIGMOD, pages 671–682, 2006.

D. J. Abadi, etc: Materialization Strategies in a Column-Oriented DBMS. In ICDE, pages 466–475, 2007.

History

Page 4: Column and hadoop

PAX

Columnar storage

(Columnar) compression

PPD vs Index or MV

SerDe

File Format

Page 5: Column and hadoop

PAX

(Picture From oracle blog)

Page 6: Column and hadoop

Columnar Store vs Row Store

● IO-1 (basic column store): Every storage block contains data from only ONE column.

● IO-2: Aggressive compression.● IO-3: No record-ids.● CPU-4: A column executor● CPU-5: Executor runs on compressed data.● CPU-6: Executor can process columns that are key se

quence or entry sequence.

Page 7: Column and hadoop

Columnar Store advantage

● Compression

RLE, Bitmap ..

● Ppd

reduce IO

● Late Materialization

less memeory and CPU overhead

● Block Iteration (Vectorization)

less CPU overhead

● Invisible Join

– block as join key

Page 8: Column and hadoop

Compression

● Run-length Encoding● ENCODING DELTAVAL● Bit Vector Encoding● BLOCK_DICT

data skew

compound

● High Selectivity :

Gender ,age● Mid Selectivity :

City , Category● Low Selectivity :

item_id , user_id

Price,quantity,

comment

Page 9: Column and hadoop

Column File Format

(Picture From Vertica Blog)

Page 10: Column and hadoop

PPD

Prediction Push Down

Continuous IO

Compound Prediction

Max-Min in each minor Block

PAX has ppd but not efficience

Page 11: Column and hadoop

PPD

(Picture from Vertica Blog)

Page 12: Column and hadoop

late materialization

Construct Row

Apply Filter + Projection

Projections column only needed(also ppd)

Decoding Column First

Wait util process

Different Compression have difference behavior

Page 13: Column and hadoop

Early Materialization

(Picture from William McKnight)

Page 14: Column and hadoop

Late Materialization

(Picture from William McKnight)

Page 15: Column and hadoop

Common Confusion IO

Choose more column ,more close to row store

IO <5%

record-ID

Row store free space at block tail

variable length field

IO Access Pattern means scalability

Hardware Trend

Compression rate

Page 16: Column and hadoop

Common Confusion SerDe

Row or PAX SerDe

cpu cache miss

no columnar compression

Block Iteration (construct tuple or row)

Java vs C/C++

C/c++ direct memory mapping

Java Fastutil

Page 17: Column and hadoop

Reduce IO

Avoid Sort

Index join

Lookup

Pre-computation :

Join

Group by

Query Rewrite

Index and MV

Scalability

Storange cost

Complex desige

Hard maintain

High latency

Slow down loading

Lost Details

Page 18: Column and hadoop

Data Modeling

Fat table vs 3NF

Page 19: Column and hadoop

Hadoop Related

File Format

Trenvi vs IBM CIF

Schema Evolution

Portable File Format

Bigger Block Size

IO Pattern

SerDe network influence

Page 20: Column and hadoop

Hadoop Related

Storage Cost

NameNode

Less block

Bigger block size

Cold data even bigger

No Intermediate Level

JobTracker

Each Job have Less Map and reduce number

DataNode

Page 21: Column and hadoop

Hadoop Related

Real Data ingestion

Hbase + Flume

Balanced Data

Write avro file format first, then sort merge

SerDe memory reduce

Tuple Structure not row

Batch Update+Delete+Insert

Page 22: Column and hadoop

Hadoop Related

MR Performance Boost

Block Shuffle (3 times faster)

Skew data have less overhead

Less map number and bigger spill

Reduce side combine

Light Compression Codec(snappy not LZO)

Combiner or in-memroy combiner deprecated

Page 23: Column and hadoop

Hadoop Related

Easier Performance Tuning mapred.min.split.size(deprecated)

mapred.child.java.opts

mapred.compress.map.output(deprecated)

io.sort.mb

io.sort.spill.percent(deprecated)

Io.sort.factor

mapred.reduce.parallel.copies(deprecated)

Map and reduce number easier estimate

Reduce algorithm will change

Page 24: Column and hadoop

Hadoop Related

Easy Management Less Partition or Dynamic Partition

Integrity constraints and Referential integrity

Statistic make simple query engine

Cold Data automatic merge

Trojan Layout vs Columnar Projections

Less Design complexity

Map join vs Fat Table

Group by + Index

Page 25: Column and hadoop
Page 26: Column and hadoop

Reference

● http://www.dbms2.com/2011/02/06/columnar-compression-database-storage/

● http://cs-www.cs.yale.edu/homes/dna/talks/Column_Store_Tutorial_VLDB09.pdf

● http://www.infoq.com/news/2011/09/nosqlnow-columnar-databases/

● DREMEL Melnik, Gubarev, Long, Romer, Shivakumar, & Tolton, VLDB 2010

● Trenvi http://avro.apache.org/docs/current/trevni/spec.html

● http://www.vertica.com/2011/09/01/the-power-of-projections-part-1/

Page 27: Column and hadoop

Thank you!

Q & A

Alex Jiang

gemini5201314 at gmail dot com

http://www.gemini5201314.net