hbase applications - atlanta hug - may 2014

1

HBase Applica-ons Selected Use-‐Cases around a Common Theme Atlanta HUG –May 2014 Lars George, Cloudera EMEA Chief Architect

2

About Me

•  EMEA Chief Architect @ Cloudera •  Consul-ng on Hadoop projects (everywhere)

•  Apache CommiNer •  HBase and Whirr

•  O’Reilly Author •  HBase – The Defini-ve Guide

•  Now in Japanese!

•  Contact •  [email protected] •  @larsgeorge

日本語版も出ました!

3

The Content...

•  HBase -‐ Strengths and weaknesses •  Common use-‐cases and paNerns •  Focus on specific type of applica-ons •  Summary

4 CONFIDENTIAL -‐ RESTRICTED

HBase Strength and Weaknesses

5

IOPS vs Throughput Mythbusters

It is all physics in the end, you cannot solve an I/O problem without reducing I/O in general. Parallelize access and read/write sequen-ally.

6

HBase: Strengths & Weaknesses

Strengths: •  Random access to small(ish) key-‐value pairs •  Rows and columns stored sorted lexicographically •  Adds table and region concepts to group related KVs •  Stores and reads data sequen-ally •  Parallelizes across all clients

•  Non-‐blocking I/O throughout

7

HBase: Strengths & Weaknesses

Weaknesses: •  Not op-mized (yet) for 100% possible throughput of underlying storage layer

•  And HDFS is not op-mized fully either

•  Single writer issue with WALs •  Single server hot-‐spojng with non-‐distributed keys

8

PaNerns

•  There are common paNerns in many common use-‐cases, like programming paNerns.

• We need to extract these common paNerns and make them repeatable.

•  Similar to the “Gang of Four” (Gamma, Helm, Johnson, Vlissides), or the “Three Amigos” (Booch, Jacobson, Rumbaugh)


Common PaNerns

10

HBase Dilemma

Although HBase can host many applica-ons, they may require completely opposite features

Events Entities

Time Series Message Store

11

This talk (at this event)

• Message Store •  Informa-on exchange between en--es •  Sending/Receiving informa-on is an event

•  Time-‐Series •  Sequence of data points measure at successive points in -me, spaced at uniform intervals

•  Measuring of a data point is an event

12

Using HBase Strengths

13

HBase “Indexes” (cont.)

•  Use primary keys, aka the row keys, as sorted index •  One sort direc-on only •  Use “secondary index” to get reverse sor-ng

•  Lookup table or same table

•  Use secondary keys, aka the column qualifiers, as sorted index within main record

•  Use prefixes within a column family or separate column families


Common Use-‐Cases

15

Use-‐Case I: Messages

16

HBase Message Store

Use-‐Case: •  Store incoming messages in HBase, such as Emails, SMS, MMS, IM

•  Constant updates of exis-ng en--es •  e.g. Email read, flagged, starred, moved, deleted

•  Reading of top-‐N entries, sorted by -me •  Newest 20 messages, last 20 conversa-ons

•  Examples: •  Facebook Messages

17

Problem Descrip-on

•  Records are of varying size •  Large ones hinder smaller ones

•  Massive index issue •  User can sort, filter by everything •  At the same -me reading top-‐N should be fast •  But what to do for automated accounts? 80/20 rule? •  Only doable with heuris-cs

•  Only create minimal indexes •  Create addi-onal ones when user asks for it

•  Cross mailbox issues with Conversa-ons •  Similar to -meline in Facebook

•  Overall requirements for I/O

18

Interlude I: Compaction Details

Write Amplification in HBase

19

Compac-ons in HBase

• Must happen to keep data in check •  Combine small flush files into larger ones •  Remove old data (during major compac-ons)

•  Two types: Minor and Major Compac-ons •  Minor are triggered with API muta-on calls •  Major are -me scheduled (or auto-‐promoted) •  Both can be triggered manually if needed

•  Add extra background I/O that grows over -me •  Write amplifica-on!

•  Have to be tuned for heavy write systems

20

Writes: Flushes and Compac-ons

Older Newer TIME

SIZE (MB)

1000

0

250

500

750

HF1

hbase.hregion.memstore.flush.size = 128MB

21


Older Newer TIME

SIZE (MB)

1000

0

250

500

750

HF1 HF2 HF1

22


HF3

Older Newer TIME

SIZE (MB)

1000

0

250

500

750

HF2 HF1

hbase.hstore.compaction.min = 3 hbase.hstore.compactionThreshold = 3 (0.90)

hbase.hstore.compaction.max = 10

23


CF1

Older Newer TIME

SIZE (MB)

1000

0

250

500

750 1. Compaction (Major auto promoted)

24


Older Newer TIME

SIZE (MB)

1000

0

250

500

750

CF1

HF4

25


Older Newer TIME

SIZE (MB)

1000

0

250

500

750

CF1

HF4 HF5 HF4

26


HF6

Older Newer TIME

SIZE (MB)

1000

0

250

500

750

CF1

HF5 HF4

27


HF6

Older Newer TIME

SIZE (MB)

1000

0

250

500

750

CF1

HF5 HF4

hbase.hstore.compaction.ratio = 1.2

hbase.hstore.compaction.min.size = flush size

28


HF6

Older Newer TIME

SIZE (MB)

1000

0

250

500

750

CF1 HF5

HF4


120%

29


Older Newer TIME

SIZE (MB)

1000

0

250

500

750

CF2

2. Compaction (Major auto promoted)

30


Older Newer TIME

SIZE (MB)

1000

0

250

500

750

CF2

HF7

CF2

31


Older Newer TIME

SIZE (MB)

1000

0

250

500

750

HF7 HF8

CF2

32


Older Newer TIME

SIZE (MB)

1000

0

250

500

750

HF7 HF8

CF2

HF9

33


Older Newer TIME

SIZE (MB)

1000

0

250

500

750

HF7 HF8

CF2

HF9 HF10

34


Older Newer TIME

SIZE (MB)

1000

0

250

500

750

HF7

HF8

CF2 HF9

HF10


120%

Eliminate older to newer files, until in ratio

35


Older Newer TIME

SIZE (MB)

1000

0

250

500

750

CF2

CF3

3. Compaction

36

Fast Forward...

37


Older Newer TIME

SIZE (MB)

1000

0

250

500

750

38

Addi-onal Notes #1

There are a few more sejngs for compac-ons: •  hbase.hstore.compaction.max = 10 Limit per maximum number of files per compac-on

•  hbase.hstore.compaction.max.size = Long.MAX_VALUE Exclude files larger than that sejng (0.92+)

•  hbase.hregion.majorcompaction = 1d Scheduled major compac-ons

39

Addi-onal Notes #2

•  hbase.hstore.compaction.kv.max = 10 Limits internal scanner caching during read of files to be compacted

•  hbase.hstore.blockingStoreFiles = 7 Enforces upper limit of files for compac-ons to catch up -‐ blocks user opera-ons!

•  hbase.hstore.blockingWaitTime = 90s Upper limit on blocking user opera-ons

40

Write Fragmentation Yo, where’s the data at?

41


Older Newer TIME

SIZE (MB)

1000

0

250

500

750

Existing Row Mutations Unique Row Inserts

We are looking at two specific rows, one is never changed, the other frequently

42


Older Newer TIME

SIZE (MB)

1000

0

250

500

750


43


Older Newer TIME

SIZE (MB)

1000

0

250

500

750


44


Older Newer TIME

SIZE (MB)

1000

0

250

500

750

1. Compaction (Major auto promoted)


45


Older Newer TIME

SIZE (MB)

1000

0

250

500

750


46


Older Newer TIME

SIZE (MB)

1000

0

250

500

750


47

Skip forward again...

48


Older Newer TIME

SIZE (MB)

1000

0

250

500

750


49

Sou

rce:

http

://w

ww

.ngd

ata.

com

/vis

ualiz

ing-

hbas

e-flu

shes

-and

-com

pact

ions

/

50

Compac-on Summary

•  Compac-on tuning is important •  Do not be too aggressive or write amplifica-on is no-ceable under load

•  Use -mestamp/-‐ranges in Get/Scan to limit files

Ra+o Effect

1.0 Dampened, causes more store files, needs to be combined with an effec-ve Bloom filter usage (non random)

1.2 Default value, moderate sejng

1.4 More aggressive, keeps number of files low, causes more auto promoted major compac-ons to occur

51

Interlude II: Bloom Filter Call me maybe, baby?

52

Background on Bloom Filters

53

Background on Bloom Filters

•  Bit arrays of m bits, an k hash func-ons •  HBase uses Hash folding

•  Returns “No” or “Maybe” only •  Error rate tunable, usually about 1% •  At 1% error rate, op-mal k 9.6 bits per key

m=18, k=3

54

Seeking with Bloom Filters

55

Read Time Series Entry

•  Event record is wriNen once and never deleted or updated

•  Keeps en-re record in specific loca-on in storage files

•  Use -me range to indicate what is needed •  {Get|Scan}.setTimeRange() •  Helps system to skip unnecessary (older) files

•  Bloom Filter helps for given row key(s) and column qualifiers

•  Can skip files not containing requested details

56


Older Newer TIME

SIZE (MB)

1000

0

250

500

750


Single Block Read (64K) Block filter and/or -me range eliminates all other store files

57

Read Updateable En-ty

•  Data is updated regularly, aging out at intervals •  Reading en-ty needs to read all details to recons-tute the current state

•  Deletes mask out aNributes •  Updates overrides (or complements) aNributes

•  Bloom filters will have a hard -me to say “no” since most files might contain en-ty aNributes

•  Time filter on scans or gets also has few op-ons to skip files since older aNributes might s-ll be important

58


Older Newer TIME

SIZE (MB)

1000

0

250

500

750

Bloom Filter returns “yes” for all but two files: 7+ block loads (64KB) needed

yes

yes yes yes

yes no

yes yes no

59

Bloom Filter Op-ons

There are three choices: •  NONE Duh! Use this when the Bloom Filter is not useful based on the use-‐case (Default sejng)

•  ROW Index only row key, needs an entry per row key in Bloom Filter

•  ROWCOL Index row and column key, requires an entry in the Filter for every column cell (KeyValue)

60

How to decide?

61

Bloom Filter Summary

•  They help a lot -‐ but not always •  Highly depends on write paNerns

•  Keep an eye on size, since they are cached •  HFile v2 helps here as it only loads root index info

“Bloom filters can get as large as 100 MB per HFile, which adds up to 2 GB when aggregated over 20 regions. Block indexes can grow as large as 6 GB in aggregate size over the same set of regions.”

Source: hNp://hbase.apache.org/book/hfilev2.html

62

Interlude III: Write-ahead Log

The lonesome writer tale.

63

Write-‐ahead Log -‐ Data Flow

64

Write-‐ahead Log -‐ Overview

•  One file per Region Server •  All regions have a reference to this file

•  Actually a wrapper around the physical file •  The file is in the end a Hadoop SequenceFile

•  Stored in HDFS so it can be recovered ater a server failure

•  There is a synchroniza+on barrier that impacts all parallel writers, aka clients

•  Overall performance is BAD, maybe 10MB/s

65

Write-‐ahead Log -‐ Workarounds

•  Enable log compression hbase.regionserver.wal.enablecompression

•  Disable WAL for secondary records •  Restore indexes or derived records from main one •  But be careful to use coprocessor hook as it cannot access currently replaying region

• Work on upstream JIRAs •  Mul+ple logs per server •  Fix single writer issue in HDFS

66

Back to the main theme...

Yes, message stores.

67

Schema

•  Every line is an inbox •  Indexes as CFs or separate tables

•  Random updates and inserts cause storage file churn •  Facebook used more than 4 or 5 schema itera+ons

•  Not representa-ve really: pure blob storage •  Evolved over -me to be more HBase like

•  Another customer iterated about the same -me over various schemas

•  Difficult to keep indexes up to date

68

Facebook Messages An interesting use-case…

69

Facebook Messages -‐ Sta-s-cs

Source: HBaseCon 2012 - Anshuman Singh

72

Schema 1

73

Notes on Facebook Schema 1

This is basically the same as the NameNode, i.e. the applica-on only writes edits and those are merged with a snapshot of the data. The applica-on does not use HBase as an opera-onal store, but all data is cached in memory. Writes occasionally large chunks, and reads only a few -mes to merge or recover.

74


Three column families: •  Snapshot, Ac+ons, Keywords Sejngs changes: •  DFS Block Size: 256MB

•  Since large KVs are wriNen •  Efficiency of HFile block index a concern

•  Compac-on ra-o: 1.4 •  Be more aggressive to clean up files

•  Split Size: 2TB •  Manage splijng manually

•  Major Compac-ons: 3 days

75

Schema 2

76


•  Eight column families •  Snapshots per thread (user to user)

Sejngs changes: •  Block Cache Size: 55%

•  Cache more data on HBase side •  Blocking Store Files: 25

•  Allow more files to be around •  Compac-on Min Size: 4MB

•  Reduce number of uncondi-onally selected files •  Major Compac-ons: 14 days

77

Schema 2

78


•  Eleven column families •  Twenty regions per server •  One hundred server per cluster

Sejngs changes: •  Block Cache Size: 60%

•  Cache more data on HBase side

•  Region Slop: 5% (from 20%) •  Keep strict boundaries on regions per server

80

Note the imbalance! Recall flushes are interconnected and causes compac-on storms.

81

FB Messages Summary

•  Triggered many changes in HBase: •  Change compac-on selec-on algorithm •  Upper bounds on file sizes •  Pools for small and large compac-ons •  Online schema changes •  Finer grained metrics •  Lazy seeking in files •  Point-‐seek op-miza-ons •  …

82

FB Messages Summary

• Went from “Snapshot” to more proper schema •  Needed to wait for schema to seNle •  Could sustain warped load for a while •  Eventually uses HBase more as KV store

•  Tweaked sejngs depending on schema •  Tuned compac-ons from aggressive to relaxed •  Changed block sizes to fit KV sizes

•  Strict limit on I/O •  100 server •  20 regions per server •  50 million users per cluster

83

Use-‐Case II: Time Series Database

84

Events make big data big

• Majority use cases are dealing with event based data •  Especially on HDFS and MapReduce level

• Machine Scale vs. Human Scale •  Event has aNributes

•  Type •  Iden-fier •  Actor •  Other aNributes

85

Events contd.

•  Accessing event data •  Give me everything about event e_id1 •  Give me everything in [t1,t2] •  Give me everything for event type e_t1 in [t1,t2] •  Give me everything for actor a1 in [t1,t2] •  Give me everything for event type e_t1 by actor a1 in [t1,t2]

•  Aggregate based on some parameters (like above) and report

•  Find events that match some other given criteria

86

HBase and Time Series

•  Access paNerns suited for HBase •  Random access to event data or aggregate data •  Serving… Not real -me compu-ng (that’s Impala)

•  Schema design is the tricky thing •  OpenTSDB does this well (but limited) •  Key principle:

•  Collocate data you want to read together •  Spread out as much as possible at write -me •  The above two are conflic-ng in a lot of cases. So, you decide on trade off

87

Time Series design paNerns

•  Ingest •  Flume or direct wri-ng via app

•  HDFS •  Batch queries in Hive •  Faster queries in Impala •  No user -me serving

•  HBase •  Serve individual events (OpenTSDB) •  Serve pre-‐computed aggregates (OpenTSDB, FB Insights)

•  Solr •  To make individual events searchable

88

Time Series design paNerns

•  Land data in HDFS and HBase •  Aggregate in HDFS and write to HBase

•  HBase can do some aggregates too (counters)

•  Keep serve-‐able data in HBase. Then discard (TTL tw) •  Keep all data in HDFS for future use

89

The story with only HBase

•  Landing des-na-on •  Aggregates via counters •  Serving end users •  Event -‐> Flume/App -‐> HBase

•  Raw entry in HBase for exact value •  Mul-ple counter increments for aggregates

•  OSS implementa-on -‐ OpenTSDB

90

Overall Summary

91

Applica-ons in HBase

Requires working with schema peculiari-es and implementa-on idiosyncrasies. Important is to compute write rate and un-‐op+mize schema to fit given hardware. If hardware is no issue then the op-mum is achievable. Trifacta of good performance: Compac+ons, Bloom Filters, and key design. (but also look out for Memstore and Blockcache sejngs)

92

Ques-ons?

hbase applications - atlanta hug - may 2014

Technology