apache hbase 0.98
DESCRIPTION
An introduction to new features in Apache HBase, with Chinese translation.TRANSCRIPT
-
Apache HBase 0.98
Andrew Purtell Committer, Apache HBase, Apache Software Foundation
Big Data US Research And Development, Intel
: Tianyou Li,
-
Who Am I?
Committer and PMC Member, Apache HBase project
Apache HBase Committer
Member of the Big Data Research And Development Group at Intel
Release manager for Apache HBase 0.98
Apache HBase 0.98
-
What is Apache HBase?
A high performance horizontally scalable datastore engine for Big Data, suitable as the store of record for
mission critical data
Apache Software Foundation community project
Apache
Open source
Free license
-
HBase and Big Data
1994-2006: Large Internet companies first encounter Big Data
1994-2006:
(Today: 94% corporate data growth YoY)
(:94%)
-
HBase and Big Data
2006-today: The openness of the early leaders provides a blueprint for motivated and talented open
source communities
2006-: .
Google Apache, Yahoo,
FB ?
Distributed filesystem
GFS HDFS
Horizontally scalable database
BigTable HBase
Parallel programming model
MapReduce Hadoop
Distributed lock manager
Chubby ZooKeeper
-
HBase and Big Data
Now: HBase is a foundation of Big Data use cases
: HBase
-
HBase and Hadoop Sq
oo
p
RD
B D
ata
Co
llect
or
Flu
me
Lo
g D
ata
Co
llect
or
Zoo
kee
per
C
oo
rdin
atio
n
YARN (MRv2) Cluster Resource
Manager / MapReduce
HDFS 2.0 Hadoop Distributed File System
Giraph Graph analysis
framework
HBase Coprocessors Data execution engine
HBase Distributed Database
The Java Virtual Machine Hadoop
Common JNI
Spark Iterative In-Memory
Computation
Mahout Data mining
Pig Data Manipulation
Hive Structured Query
Oozie Data Flow
Shark Structured Query
R Statistics
-
The HBase Data Model (HBase )
(Tablespaces)
Not a spreadsheet, think of a distributed sorted map
-
How HBase Achieves Scalability
HBase
RegionServers
Table A
Table B Splits
Assignments
Regions
-
HBase As Data Application Platform
HBase Coprocessors()
In-process system extension framework()
Observers
(Like triggers) () Endpoints
(Like stored procedures)
()
System integrators can deploy application code that runs where the data resides
-
HBase Differentiators
HBase RDBMS
HBase
Data layout
Row oriented
Column oriented
Transactions
Multi-row ACID ACID
Multi-row within region only region
Query language
Native SQL SQL
No native query language SQL
Security
AuthN and AuthZ (ACL)
AuthN and AuthZ (ACL, Visibility labels) new in 0.98 and(, ) 0.98
Indexes
On arbitrary columns
Single row index only
Max data size
Terabytes TB
Petabytes PB
R/W throughput limits
1000s of operations per second 1000
Millions of operations per second
-
New In Apache HBase 0.98.0
Apache HBase 0.98.0
New security features and improvements
Cell tags
HFile v3
Transparent server side encryption (HBASE-7544)
Per-cell ACLs (HBASE-7662)
Cell level visibility labels (HBASE-7663)
EXEC access permission checks for Endpoints (HBASE-6104)
Endpoints EXEC
-
New In Apache HBase 0.98.0
Apache HBase 0.98.0 New features
Reverse scans (HBASE-4811)
MapReduce over snapshots (HBASE-8369)
MapReduce
Performance improvements
Improved WAL write threading model (HBASE-8755)
WAL Stripe compactions (HBASE-7667)
REST streaming scans (HBASE-9343)
REST
-
Cell Tags()
All values written to HBase are stored into cells
HBase(cells)
Cells can now also carry one or more tags
Cells(tags) Metadata, considered distinct from the key and the value
, (key and value)
We use tags to implement per cell ACLs and visibility labels
(tags)cell
-
HFile Version 3
New file format, supporting cell tags and block encryption
Enabled with a site configuration file change
hfile.format.version = 3
HFile v2 data is transparently migrated over time as new files are written by flushes and compactions
HFile v2 flush compaction
-
Transparent Encryption (HBASE-7544)
Built on a new cryptographic codec and key management framework inside HBase
HBase
Transparent encryption of HBase on disk data
HBase
Supports schema design that places sensitive information in only a subset of column families
column families
-
Transparent Encryption (HBASE-7544)
-
Per-Cell ACLs (HBASE-7662)
Extends the existing HBase ACL model with support
for persisting and checking per-cell ACL data in tags
HBasetags
Backwards compatible
We timestamp ACLs on a cell like any other
HBase data for
straightforward policy
evolution
-
Visibility Labels (HBASE-7663)
Visibility expression support via new security coprocessor
Labels: arbitrary strings
:
Expressions: Labels joined in boolean expressions
:
Operators: &, |, !, ( )
: &, |, !, ( )
secret
secret | topsecret
( secret | topsecret ) & !probationary
-
Visibility Labels (HBASE-7663)
New client APIs and new shell commands for label management, similar to those of Apache Accumulo, for easy
migration
API Apache Accumulo,
Users specify visibility expressions on cells
cell
Users ask for authorizations on Gets and Scans
(Gets Scans)
The server decides which authorizations are valid
Scan results are filtered according to the users visibility
Scan
-
Endpoint EXEC Grants (HBASE-6104)
HBase ACLs grant a familiar set of privileges to users and groups:
HBase : (R)ead, (W)rite, E(X)excute, (C)reate, (A)dmin
, , ,,
However, versions prior to 0.98.0 ignore X
, 0.98.0 E(X)excute ()
Now access to coprocessor Endpoint invocations can be controlled on a global, per-table, or per-column
family basis
(coprocessor Endpoint)column-family
-
Reverse Scans (HBASE-4811)
A new scanner type that seeks to the end of a range and then steps backwards
(Scan)
No longer necessary to manually maintain reverse index tables for descending sorts
Exposed at the client with a new Scan option
Scan Scan#setReversed(boolean reversed)
Performance is on par with normal (forward) scanning
(Scan)
-
MapReduce Over Snapshots (HBASE-8369)
Adds MapReduce utilities supporting jobs over snapshots of table data
MapReduce snapshotMapreduce job
Clients can skip the HBase API and read HFiles directly on disk from a table snapshot
HBase API Can increase throughput ~5x by skipping many system layers
5
Not recommended from a security perspective
Built in access control is completely bypassed
-
Improved WAL Write Throughput (HBASE-8755)
WAL
Introduces a new threading model for WAL writes that reduces lock contention
WAL
Provides better write throughput when under load, a ~15% improvement in write ops/sec at high write
concurrency
15%
-
Stripe Compactions (HBASE-7667)
Stripe compactions split the data inside the region by row key and create sub-ranges of data
Stripe compactionsrowkeyRegion Sub-ranges are compacted independently
compact
Can reduce read latency variability and reduce compaction data volume (write amplification)
compact
Some use cases can benefit but the feature is complex to configure and tune, consult the documentation for detail
,,
-
REST Streaming Scans (HBASE-9343)
REST
Introduces a new scanning mode to the REST API for stateless scanning
REST API (Scan)
The client manages paging and limits
Instead of forcing a batching up of results as they come back from the RegionServers into multiple HTTP
transactions, the stateless scanner can stream all
results back to the client over one HTTP connection
HTTP RegionServersHTTP
-
Upgrading to HBase 0.98.0
HBase 0.98.0
Direct upgrade possible from 0.94 0.98 using an offline data migration procedure
0.94 0.98
Upgrade from 0.96 0.98 is seamless
0.96 0.98 Wire compatibility
Mixed clientserver and serverserver operation with 0.96 possible as long as no 0.98 specific features enabled
0.98 -> ->
Binary API compatibility not guaranteed, some applications may need minor changes
Binary API,
-
Future of HBase 0.98.x Branch
HBase 0.98.x Branch Minor releases (0.98.1, 0.98.2, etc.) expected, these
will contain:
(0.98.1, 0.98.2 .), : Bug fixes
Bugs Performance improvements
Deprecations of some APIs for HBase 1.0
APIsHBase 1.0 Tag compression in HFile
Tag Hfile Performance improvements for encryption
-
End
Questions?