hbase applications - atlanta hug - may 2014
DESCRIPTION
HBase is good a various workloads, ranging from sequential range scans to purely random access. These access patterns can be translated into application types, usually falling into two major groups: entities and events. This presentation discussed the underlying implications and how to approach those use-cases. Examples taken from Facebook show how this has been tackled in real life.TRANSCRIPT
1
HBase Applica-ons Selected Use-‐Cases around a Common Theme Atlanta HUG –May 2014 Lars George, Cloudera EMEA Chief Architect
2
About Me
• EMEA Chief Architect @ Cloudera • Consul-ng on Hadoop projects (everywhere)
• Apache CommiNer • HBase and Whirr
• O’Reilly Author • HBase – The Defini-ve Guide
• Now in Japanese!
• Contact • [email protected] • @larsgeorge
日本語版も出ました!
3
The Content...
• HBase -‐ Strengths and weaknesses • Common use-‐cases and paNerns • Focus on specific type of applica-ons • Summary
4 CONFIDENTIAL -‐ RESTRICTED
HBase Strength and Weaknesses
5
IOPS vs Throughput Mythbusters
It is all physics in the end, you cannot solve an I/O problem without reducing I/O in general. Parallelize access and read/write sequen-ally.
6
HBase: Strengths & Weaknesses
Strengths: • Random access to small(ish) key-‐value pairs • Rows and columns stored sorted lexicographically • Adds table and region concepts to group related KVs • Stores and reads data sequen-ally • Parallelizes across all clients
• Non-‐blocking I/O throughout
7
HBase: Strengths & Weaknesses
Weaknesses: • Not op-mized (yet) for 100% possible throughput of underlying storage layer
• And HDFS is not op-mized fully either
• Single writer issue with WALs • Single server hot-‐spojng with non-‐distributed keys
8
PaNerns
• There are common paNerns in many common use-‐cases, like programming paNerns.
• We need to extract these common paNerns and make them repeatable.
• Similar to the “Gang of Four” (Gamma, Helm, Johnson, Vlissides), or the “Three Amigos” (Booch, Jacobson, Rumbaugh)
9 CONFIDENTIAL -‐ RESTRICTED
Common PaNerns
10
HBase Dilemma
Although HBase can host many applica-ons, they may require completely opposite features
Events Entities
Time Series Message Store
11
This talk (at this event)
• Message Store • Informa-on exchange between en--es • Sending/Receiving informa-on is an event
• Time-‐Series • Sequence of data points measure at successive points in -me, spaced at uniform intervals
• Measuring of a data point is an event
12
Using HBase Strengths
13
HBase “Indexes” (cont.)
• Use primary keys, aka the row keys, as sorted index • One sort direc-on only • Use “secondary index” to get reverse sor-ng
• Lookup table or same table
• Use secondary keys, aka the column qualifiers, as sorted index within main record
• Use prefixes within a column family or separate column families
14 CONFIDENTIAL -‐ RESTRICTED
Common Use-‐Cases
15
Use-‐Case I: Messages
16
HBase Message Store
Use-‐Case: • Store incoming messages in HBase, such as Emails, SMS, MMS, IM
• Constant updates of exis-ng en--es • e.g. Email read, flagged, starred, moved, deleted
• Reading of top-‐N entries, sorted by -me • Newest 20 messages, last 20 conversa-ons
• Examples: • Facebook Messages
17
Problem Descrip-on
• Records are of varying size • Large ones hinder smaller ones
• Massive index issue • User can sort, filter by everything • At the same -me reading top-‐N should be fast • But what to do for automated accounts? 80/20 rule? • Only doable with heuris-cs
• Only create minimal indexes • Create addi-onal ones when user asks for it
• Cross mailbox issues with Conversa-ons • Similar to -meline in Facebook
• Overall requirements for I/O
18
Interlude I: Compaction Details
Write Amplification in HBase
19
Compac-ons in HBase
• Must happen to keep data in check • Combine small flush files into larger ones • Remove old data (during major compac-ons)
• Two types: Minor and Major Compac-ons • Minor are triggered with API muta-on calls • Major are -me scheduled (or auto-‐promoted) • Both can be triggered manually if needed
• Add extra background I/O that grows over -me • Write amplifica-on!
• Have to be tuned for heavy write systems
20
Writes: Flushes and Compac-ons
Older Newer TIME
SIZE (MB)
1000
0
250
500
750
HF1
hbase.hregion.memstore.flush.size = 128MB
21
Writes: Flushes and Compac-ons
Older Newer TIME
SIZE (MB)
1000
0
250
500
750
HF1 HF2 HF1
22
Writes: Flushes and Compac-ons
HF3
Older Newer TIME
SIZE (MB)
1000
0
250
500
750
HF2 HF1
hbase.hstore.compaction.min = 3 hbase.hstore.compactionThreshold = 3 (0.90)
hbase.hstore.compaction.max = 10
23
Writes: Flushes and Compac-ons
CF1
Older Newer TIME
SIZE (MB)
1000
0
250
500
750 1. Compaction (Major auto promoted)
24
Writes: Flushes and Compac-ons
Older Newer TIME
SIZE (MB)
1000
0
250
500
750
CF1
HF4
25
Writes: Flushes and Compac-ons
Older Newer TIME
SIZE (MB)
1000
0
250
500
750
CF1
HF4 HF5 HF4
26
Writes: Flushes and Compac-ons
HF6
Older Newer TIME
SIZE (MB)
1000
0
250
500
750
CF1
HF5 HF4
27
Writes: Flushes and Compac-ons
HF6
Older Newer TIME
SIZE (MB)
1000
0
250
500
750
CF1
HF5 HF4
hbase.hstore.compaction.ratio = 1.2
hbase.hstore.compaction.min.size = flush size
28
Writes: Flushes and Compac-ons
HF6
Older Newer TIME
SIZE (MB)
1000
0
250
500
750
CF1 HF5
HF4
hbase.hstore.compaction.ratio = 1.2
120%
29
Writes: Flushes and Compac-ons
Older Newer TIME
SIZE (MB)
1000
0
250
500
750
CF2
2. Compaction (Major auto promoted)
30
Writes: Flushes and Compac-ons
Older Newer TIME
SIZE (MB)
1000
0
250
500
750
CF2
HF7
CF2
31
Writes: Flushes and Compac-ons
Older Newer TIME
SIZE (MB)
1000
0
250
500
750
HF7 HF8
CF2
32
Writes: Flushes and Compac-ons
Older Newer TIME
SIZE (MB)
1000
0
250
500
750
HF7 HF8
CF2
HF9
33
Writes: Flushes and Compac-ons
Older Newer TIME
SIZE (MB)
1000
0
250
500
750
HF7 HF8
CF2
HF9 HF10
34
Writes: Flushes and Compac-ons
Older Newer TIME
SIZE (MB)
1000
0
250
500
750
HF7
HF8
CF2 HF9
HF10
hbase.hstore.compaction.ratio = 1.2
120%
Eliminate older to newer files, until in ratio
35
Writes: Flushes and Compac-ons
Older Newer TIME
SIZE (MB)
1000
0
250
500
750
CF2
CF3
3. Compaction
36
Fast Forward...
37
Writes: Flushes and Compac-ons
Older Newer TIME
SIZE (MB)
1000
0
250
500
750
38
Addi-onal Notes #1
There are a few more sejngs for compac-ons: • hbase.hstore.compaction.max = 10 Limit per maximum number of files per compac-on
• hbase.hstore.compaction.max.size = Long.MAX_VALUE Exclude files larger than that sejng (0.92+)
• hbase.hregion.majorcompaction = 1d Scheduled major compac-ons
39
Addi-onal Notes #2
• hbase.hstore.compaction.kv.max = 10 Limits internal scanner caching during read of files to be compacted
• hbase.hstore.blockingStoreFiles = 7 Enforces upper limit of files for compac-ons to catch up -‐ blocks user opera-ons!
• hbase.hstore.blockingWaitTime = 90s Upper limit on blocking user opera-ons
40
Write Fragmentation Yo, where’s the data at?
41
Writes: Flushes and Compac-ons
Older Newer TIME
SIZE (MB)
1000
0
250
500
750
Existing Row Mutations Unique Row Inserts
We are looking at two specific rows, one is never changed, the other frequently
42
Writes: Flushes and Compac-ons
Older Newer TIME
SIZE (MB)
1000
0
250
500
750
Existing Row Mutations Unique Row Inserts
43
Writes: Flushes and Compac-ons
Older Newer TIME
SIZE (MB)
1000
0
250
500
750
Existing Row Mutations Unique Row Inserts
44
Writes: Flushes and Compac-ons
Older Newer TIME
SIZE (MB)
1000
0
250
500
750
1. Compaction (Major auto promoted)
Existing Row Mutations Unique Row Inserts
45
Writes: Flushes and Compac-ons
Older Newer TIME
SIZE (MB)
1000
0
250
500
750
Existing Row Mutations Unique Row Inserts
46
Writes: Flushes and Compac-ons
Older Newer TIME
SIZE (MB)
1000
0
250
500
750
Existing Row Mutations Unique Row Inserts
47
Skip forward again...
48
Writes: Flushes and Compac-ons
Older Newer TIME
SIZE (MB)
1000
0
250
500
750
Existing Row Mutations Unique Row Inserts
49
Sou
rce:
http
://w
ww
.ngd
ata.
com
/vis
ualiz
ing-
hbas
e-flu
shes
-and
-com
pact
ions
/
50
Compac-on Summary
• Compac-on tuning is important • Do not be too aggressive or write amplifica-on is no-ceable under load
• Use -mestamp/-‐ranges in Get/Scan to limit files
Ra+o Effect
1.0 Dampened, causes more store files, needs to be combined with an effec-ve Bloom filter usage (non random)
1.2 Default value, moderate sejng
1.4 More aggressive, keeps number of files low, causes more auto promoted major compac-ons to occur
51
Interlude II: Bloom Filter Call me maybe, baby?
52
Background on Bloom Filters
53
Background on Bloom Filters
• Bit arrays of m bits, an k hash func-ons • HBase uses Hash folding
• Returns “No” or “Maybe” only • Error rate tunable, usually about 1% • At 1% error rate, op-mal k 9.6 bits per key
m=18, k=3
54
Seeking with Bloom Filters
55
Read Time Series Entry
• Event record is wriNen once and never deleted or updated
• Keeps en-re record in specific loca-on in storage files
• Use -me range to indicate what is needed • {Get|Scan}.setTimeRange() • Helps system to skip unnecessary (older) files
• Bloom Filter helps for given row key(s) and column qualifiers
• Can skip files not containing requested details
56
Writes: Flushes and Compac-ons
Older Newer TIME
SIZE (MB)
1000
0
250
500
750
Existing Row Mutations Unique Row Inserts
Single Block Read (64K) Block filter and/or -me range eliminates all other store files
57
Read Updateable En-ty
• Data is updated regularly, aging out at intervals • Reading en-ty needs to read all details to recons-tute the current state
• Deletes mask out aNributes • Updates overrides (or complements) aNributes
• Bloom filters will have a hard -me to say “no” since most files might contain en-ty aNributes
• Time filter on scans or gets also has few op-ons to skip files since older aNributes might s-ll be important
58
Writes: Flushes and Compac-ons
Older Newer TIME
SIZE (MB)
1000
0
250
500
750
Bloom Filter returns “yes” for all but two files: 7+ block loads (64KB) needed
yes
yes yes yes
yes no
yes yes no
59
Bloom Filter Op-ons
There are three choices: • NONE Duh! Use this when the Bloom Filter is not useful based on the use-‐case (Default sejng)
• ROW Index only row key, needs an entry per row key in Bloom Filter
• ROWCOL Index row and column key, requires an entry in the Filter for every column cell (KeyValue)
60
How to decide?
61
Bloom Filter Summary
• They help a lot -‐ but not always • Highly depends on write paNerns
• Keep an eye on size, since they are cached • HFile v2 helps here as it only loads root index info
“Bloom filters can get as large as 100 MB per HFile, which adds up to 2 GB when aggregated over 20 regions. Block indexes can grow as large as 6 GB in aggregate size over the same set of regions.”
Source: hNp://hbase.apache.org/book/hfilev2.html
62
Interlude III: Write-ahead Log
The lonesome writer tale.
63
Write-‐ahead Log -‐ Data Flow
64
Write-‐ahead Log -‐ Overview
• One file per Region Server • All regions have a reference to this file
• Actually a wrapper around the physical file • The file is in the end a Hadoop SequenceFile
• Stored in HDFS so it can be recovered ater a server failure
• There is a synchroniza+on barrier that impacts all parallel writers, aka clients
• Overall performance is BAD, maybe 10MB/s
65
Write-‐ahead Log -‐ Workarounds
• Enable log compression hbase.regionserver.wal.enablecompression
• Disable WAL for secondary records • Restore indexes or derived records from main one • But be careful to use coprocessor hook as it cannot access currently replaying region
• Work on upstream JIRAs • Mul+ple logs per server • Fix single writer issue in HDFS
66
Back to the main theme...
Yes, message stores.
67
Schema
• Every line is an inbox • Indexes as CFs or separate tables
• Random updates and inserts cause storage file churn • Facebook used more than 4 or 5 schema itera+ons
• Not representa-ve really: pure blob storage • Evolved over -me to be more HBase like
• Another customer iterated about the same -me over various schemas
• Difficult to keep indexes up to date
68
Facebook Messages An interesting use-case…
69
Facebook Messages -‐ Sta-s-cs
Source: HBaseCon 2012 - Anshuman Singh
70
71
72
Schema 1
73
Notes on Facebook Schema 1
This is basically the same as the NameNode, i.e. the applica-on only writes edits and those are merged with a snapshot of the data. The applica-on does not use HBase as an opera-onal store, but all data is cached in memory. Writes occasionally large chunks, and reads only a few -mes to merge or recover.
74
Notes on Facebook Schema 1
Three column families: • Snapshot, Ac+ons, Keywords Sejngs changes: • DFS Block Size: 256MB
• Since large KVs are wriNen • Efficiency of HFile block index a concern
• Compac-on ra-o: 1.4 • Be more aggressive to clean up files
• Split Size: 2TB • Manage splijng manually
• Major Compac-ons: 3 days
75
Schema 2
76
Notes on Facebook Schema 2
• Eight column families • Snapshots per thread (user to user)
Sejngs changes: • Block Cache Size: 55%
• Cache more data on HBase side • Blocking Store Files: 25
• Allow more files to be around • Compac-on Min Size: 4MB
• Reduce number of uncondi-onally selected files • Major Compac-ons: 14 days
77
Schema 2
78
Notes on Facebook Schema 3
• Eleven column families • Twenty regions per server • One hundred server per cluster
Sejngs changes: • Block Cache Size: 60%
• Cache more data on HBase side
• Region Slop: 5% (from 20%) • Keep strict boundaries on regions per server
79
80
Note the imbalance! Recall flushes are interconnected and causes compac-on storms.
81
FB Messages Summary
• Triggered many changes in HBase: • Change compac-on selec-on algorithm • Upper bounds on file sizes • Pools for small and large compac-ons • Online schema changes • Finer grained metrics • Lazy seeking in files • Point-‐seek op-miza-ons • …
82
FB Messages Summary
• Went from “Snapshot” to more proper schema • Needed to wait for schema to seNle • Could sustain warped load for a while • Eventually uses HBase more as KV store
• Tweaked sejngs depending on schema • Tuned compac-ons from aggressive to relaxed • Changed block sizes to fit KV sizes
• Strict limit on I/O • 100 server • 20 regions per server • 50 million users per cluster
83
Use-‐Case II: Time Series Database
84
Events make big data big
• Majority use cases are dealing with event based data • Especially on HDFS and MapReduce level
• Machine Scale vs. Human Scale • Event has aNributes
• Type • Iden-fier • Actor • Other aNributes
85
Events contd.
• Accessing event data • Give me everything about event e_id1 • Give me everything in [t1,t2] • Give me everything for event type e_t1 in [t1,t2] • Give me everything for actor a1 in [t1,t2] • Give me everything for event type e_t1 by actor a1 in [t1,t2]
• Aggregate based on some parameters (like above) and report
• Find events that match some other given criteria
86
HBase and Time Series
• Access paNerns suited for HBase • Random access to event data or aggregate data • Serving… Not real -me compu-ng (that’s Impala)
• Schema design is the tricky thing • OpenTSDB does this well (but limited) • Key principle:
• Collocate data you want to read together • Spread out as much as possible at write -me • The above two are conflic-ng in a lot of cases. So, you decide on trade off
87
Time Series design paNerns
• Ingest • Flume or direct wri-ng via app
• HDFS • Batch queries in Hive • Faster queries in Impala • No user -me serving
• HBase • Serve individual events (OpenTSDB) • Serve pre-‐computed aggregates (OpenTSDB, FB Insights)
• Solr • To make individual events searchable
88
Time Series design paNerns
• Land data in HDFS and HBase • Aggregate in HDFS and write to HBase
• HBase can do some aggregates too (counters)
• Keep serve-‐able data in HBase. Then discard (TTL tw) • Keep all data in HDFS for future use
89
The story with only HBase
• Landing des-na-on • Aggregates via counters • Serving end users • Event -‐> Flume/App -‐> HBase
• Raw entry in HBase for exact value • Mul-ple counter increments for aggregates
• OSS implementa-on -‐ OpenTSDB
90
Overall Summary
91
Applica-ons in HBase
Requires working with schema peculiari-es and implementa-on idiosyncrasies. Important is to compute write rate and un-‐op+mize schema to fit given hardware. If hardware is no issue then the op-mum is achievable. Trifacta of good performance: Compac+ons, Bloom Filters, and key design. (but also look out for Memstore and Blockcache sejngs)
92
Ques-ons?