building apps with hbase - data days texas march 2013

106
1 Headline Goes Here Speaker Name or Subhead Goes Here Building applica2ons with HBase Amandeep Khurana | Solu7ons Architect Data Days Texas, Aus2n, March 2013 Monday, April 1, 13

Upload: amansk

Post on 08-May-2015

1.423 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Building apps with HBase - Data Days Texas March 2013

1

Headline  Goes  HereSpeaker  Name  or  Subhead  Goes  Here

Building  applica2ons  with  HBaseAmandeep  Khurana  |  Solu7ons  ArchitectData  Days  Texas,  Aus2n,  March  2013

Monday, April 1, 13

Page 2: Building apps with HBase - Data Days Texas March 2013

About  me

• Solu7ons  Architect,  Cloudera  Inc• Amazon  Web  Services• Interested  in  large  scale  distributed  systems• Co-­‐author,  HBase  In  Ac7on• TwiGer:  amansk

2

Monday, April 1, 13

Page 3: Building apps with HBase - Data Days Texas March 2013

I’m  a  wanna-­‐be  (author)

www.hbaseinac7on.com

3

M A N N I N G

Nick DimidukAmandeep Khurana

Monday, April 1, 13

Page 4: Building apps with HBase - Data Days Texas March 2013

About  the  workshop

• Install  HBase• Basic  interac7ons  with  the  shell• Use  of  HBase  Java  API• Data  model• HBase  and  Hadoop  integra7on

4

Monday, April 1, 13

Page 5: Building apps with HBase - Data Days Texas March 2013

Get  HBase  and  tutorial  instruc7ons

• Get  it  here  -­‐>  hGp://hbase.apache.org/• Downloads  -­‐>  Choose  a  mirror  -­‐>  0.94.1• Download  the  file  hbase-­‐0.94.1.tar.gz

• Get  the  tutorial  handout  -­‐>  hGp://people.apache.org/~ak/ddtx13

5

Monday, April 1, 13

Page 6: Building apps with HBase - Data Days Texas March 2013

6

...  except  another  marke7ng  term

What’s  Big  Data?

Monday, April 1, 13

Page 7: Building apps with HBase - Data Days Texas March 2013

Data  management

7

Monday, April 1, 13

Page 8: Building apps with HBase - Data Days Texas March 2013

Data  has  changed

8

DAT

A G

RO

WTH

END-USER APPLICATIONS THE INTERNET

MOBILE DEVICES SOPHISTICATED

MACHINES

STRUCTURED DATA – 10%

1980 2012

UNSTRUCTURED DATA – 90%

Monday, April 1, 13

Page 9: Building apps with HBase - Data Days Texas March 2013

“Big  Data”

9

Monday, April 1, 13

Page 10: Building apps with HBase - Data Days Texas March 2013

Hadoop  -­‐  Big  Data  poster  child

10

HDFS, MAPREDUCE

Monday, April 1, 13

Page 11: Building apps with HBase - Data Days Texas March 2013

Hadoop  -­‐  Big  Data  poster  child

10

Coordination

Data Integration Fast Read/Write Access

Languages / Compilers

Workflow Scheduling Metadata

APACHE ZOOKEEPER

APACHE FLUME, APACHE SQOOP APACHE HBASE

APACHE PIG, APACHE HIVE, APACHE MAHOUT

APACHE OOZIE APACHE OOZIE APACHE HIVE

File System Mount UI Framework Machine Learning FUSE-DFS HUE APACHE MAHOUT

HDFS, MAPREDUCE

Monday, April 1, 13

Page 12: Building apps with HBase - Data Days Texas March 2013

What  is  HBase?

• Scalable,  distributed  data  store• Sorted  map  of  maps• Open  source  avatar  of  Google’s  Bigtable• Sparse• Mul7  dimensional• Tightly  integrated  with  Hadoop• Not  a  RDBMS

11

Monday, April 1, 13

Page 13: Building apps with HBase - Data Days Texas March 2013

So,  what’s  the  big  deal?

• Give  up  system  complexity  and  sophis7cated  features  for  ease  of  scale  and  manageability

• Simple  system  !=  easy  to  build  applica7ons  on  top• Goal:  predictable  read  and  write  performance  at  scale

• Effec7ve  API  usage• Op7mal  schema  design• Understand  internals  and  leverage  at  scale

12

Monday, April 1, 13

Page 14: Building apps with HBase - Data Days Texas March 2013

Use  cases

• Canonical  use  case:  storing  crawl  data  and  indices  for  search

13

1

1 Crawlers constantly scour the Internet for new pages.Those pages are stored as individual records in Bigtable. 3

2 A MapReduce job runs over the entire table, generatingsearch indexes for the Web Search application.

4

2

5

Web Searchpowered by Bigtable

Indexing the Internet

3 The user initiates a Web Search request.

4 The Web Search application queries the Search Indexesand retries matching documents directly from Bigtable.

5 Search results are presented to the user.

Searching the Internet

BigtableInternetsCrawlers

CrawlersCrawlers

Crawlers

MapReduce

You

Search IndexSearch

IndexSearch Index

Web Search

Monday, April 1, 13

Page 15: Building apps with HBase - Data Days Texas March 2013

Use  cases  (con7nued)

• Capturing  incremental  data• Content  serving• Informa7on  exchange

14

Monday, April 1, 13

Page 16: Building apps with HBase - Data Days Texas March 2013

Ok,  enough  of  high  level  stuff

• Big  data  is  a  buzzword  but  there  is  more  to  it• New  products,  applica7ons,  types  of  data

• Hadoop  ecosystem  is  the  poster  child  of  big  data  architectures• Consists  of  several  components

• HBase  is  a  part  of  the  ecosystem• Scalable  database• Lots  of  produc7on  deployments

15

Monday, April 1, 13

Page 17: Building apps with HBase - Data Days Texas March 2013

16

Installa7on  and  basic  interac7on

HBase  -­‐  GeEng  Started

Monday, April 1, 13

Page 18: Building apps with HBase - Data Days Texas March 2013

Installing  HBase

• Get  it  here  -­‐>  hGp://hbase.apache.org/• Downloads  -­‐>  Choose  a  mirror  -­‐>  0.94.1• Download  the  file  hbase-­‐0.94.1.tar.gz• Untar  it

• mkdir  ~/ddtx• cp  hbase-­‐0.94.1.tar.gz  ~/ddtx• cd  ~/ddtx• tar  xvf  hbase-­‐0.94.1.tar.gz

17

Monday, April 1, 13

Page 19: Building apps with HBase - Data Days Texas March 2013

Basics

• Modes  of  opera2on• Standalone• Pseudo-­‐distributed• Fully  distributed

• Start  up  in  standalone  mode• cd  hbase-­‐0.94.1• bin/start-­‐hbase.sh• Logs  are  in  ./logs/

• Check  out  hSp://localhost:60010  in  your  browser

18

Monday, April 1, 13

Page 20: Building apps with HBase - Data Days Texas March 2013

Shell

• Basic  interac7on  tool• WriGen  in  JRuby• Uses  the  Java  client  library• Good  place  to  start  poking  around

19

Monday, April 1, 13

Page 21: Building apps with HBase - Data Days Texas March 2013

Shell  (con7nued)

• bin/hbase  shell• Create  table

• create  ‘users’  ,  ‘info’• List  tables

• list• Describe  table

• describe  ‘users’

20

Monday, April 1, 13

Page 22: Building apps with HBase - Data Days Texas March 2013

Shell  (con7nued)

• Put  a  row• put  ‘users’  ,  ‘u1’,  ‘info:name’  ,  ‘Mr.  Rogers’

• Get  a  row• get  ‘users’  ,  ‘u1’

• Put  more• put  ‘users’  ,  ‘u1’  ,  ‘info:email’  ,  ‘[email protected]’• put  ‘users’  ,  ‘u1’  ,  ‘info:password’  ,  ‘foobar’

• Get  a  row• get  ‘users’  ,  ‘u1’

• Scan  table• scan  ‘users’

21

Monday, April 1, 13

Page 23: Building apps with HBase - Data Days Texas March 2013

22

...  can’t  really  build  an  applica7on  just  with  the  shell

Java  API

Monday, April 1, 13

Page 24: Building apps with HBase - Data Days Texas March 2013

Important  terms• Table

• Consists  of  rows  and  columns• Row

• Has  a  bunch  of  columns.• Iden2fied  by  a  rowkey  (primary  key)

• Column  Qualifier• Dynamic  column  name

• Column  Family• Column  groups  -­‐  logical  and  physical  (Similar  access  paSern)

• Cell• The  actual  element  that  contains  the  data  for  a  row-­‐column  intersec2on

• Version• Every  cell  has  mul2ple  versions.

23

Monday, April 1, 13

Page 25: Building apps with HBase - Data Days Texas March 2013

Introducing  TwitBase

• TwiSer  clone  built  on  HBase• Designed  to  scale• It’s  an  example,  not  a  produc2on  app• Start  with  users  and  twits  for  now• Users

• Have  profiles• Post  twits

• Twits• Have  text  posted  by  users

• Get  sample  code  at:  hSps://github.com/hbaseinac2on/twitbase

24

Monday, April 1, 13

Page 26: Building apps with HBase - Data Days Texas March 2013

Connec7ng  to  HBase

• Using default confs:HTableInterface usersTable = new HTable("users");

• Passing custom conf:Configuration myConf = HBaseConfiguration.create();myConf.set("parameter_name", "parameter_value");HTableInterface usersTable = new HTable(myConf, "users");

• Connection poolingHTablePool pool = new HTablePool();HTableInterface usersTable = pool.getTable("users"); // work with the tableusersTable.close();

25

Monday, April 1, 13

Page 27: Building apps with HBase - Data Days Texas March 2013

Inser7ng  data

• Put  call  on  HTablePut p = new Put(Bytes.toBytes("TheRealMT")); p.add(Bytes.toBytes("info"), Bytes.toBytes("name"), Bytes.toBytes("Mark Twain")); p.add(Bytes.toBytes("info"), Bytes.toBytes("email"), Bytes.toBytes("[email protected]")); p.add(Bytes.toBytes("info"), Bytes.toBytes("password"), Bytes.toBytes("Langhorne"));

usersTable.put(p);

26

Monday, April 1, 13

Page 28: Building apps with HBase - Data Days Texas March 2013

Data  coordinates

• Row  is  addressed  using  rowkey• Cell  is  addressed  using              [rowkey  +  family  +  qualifier]

27

Monday, April 1, 13

Page 29: Building apps with HBase - Data Days Texas March 2013

Upda7ng  data

•Update  ==  WritePut p = new Put(Bytes.toBytes("TheRealMT"));

p.add(Bytes.toBytes("info"),

Bytes.toBytes("password"),

Bytes.toBytes("abc123"));

usersTable.put(p);

28

Monday, April 1, 13

Page 30: Building apps with HBase - Data Days Texas March 2013

Write  path

29

Memstore

HFile

HFile

WAL

Write goes to the write-ahead log (WAL)as well as the in memory write buffer called

the Memstore.Client's don't interact directly with the

underlying HFiles during writes.Client

Memstore

New HFile

HFile

Memstore gets flushed to disk toform a new HFile

when filled up with writes.

HFile

Monday, April 1, 13

Page 31: Building apps with HBase - Data Days Texas March 2013

Reading  data

• Get  the  en2re  rowGet g = new Get(Bytes.toBytes("TheRealMT"));

Result r = usersTable.get(g);

• Get  selected  columnsGet g = new Get(Bytes.toBytes("TheRealMT"));

g.addColumn(Bytes.toBytes("info"),

Bytes.toBytes("password"));

Result r = usersTable.get(g);

• Extract  the  value  out  of  the  Result  objectbyte[] b = r.getValue(Bytes.toBytes("info"),

Bytes.toBytes("email"));

String email = Bytes.toString(b);

30

Monday, April 1, 13

Page 32: Building apps with HBase - Data Days Texas March 2013

Reading  data  (con7nued)

• ScanScan s = new Scan(startRow, stopRow);

ResultsScanner rs = twits.getScanner(s);

for(Result r : rs) {

// iterate over scanner results

}

• Single threaded iteration over defined set of rows. Could be the entire table too.

31

Monday, April 1, 13

Page 33: Building apps with HBase - Data Days Texas March 2013

What  happens  when  you  read  data?

32

Memstore

HFile

HFile

Client

BlockCache

Monday, April 1, 13

Page 34: Building apps with HBase - Data Days Texas March 2013

Dele7ng  data

• Another  simple  API  call• Delete  en7re  rowDelete d = new Delete(Bytes.toBytes("TheRealMT"));

usersTable.delete(d);

• Delete  selected  columnsDelete d = new Delete(Bytes.toBytes("TheRealMT"));

d.deleteColumns(Bytes.toBytes("info"),

Bytes.toBytes("email"));

usersTable.delete(d);

33

Monday, April 1, 13

Page 35: Building apps with HBase - Data Days Texas March 2013

Delete  internals

• Tombstone  markers• Dele7on  only  happens  during  major  compac7ons• Compac7ons

• Minor• Major

34

Monday, April 1, 13

Page 36: Building apps with HBase - Data Days Texas March 2013

Compac7ons

35

Memstore

HFile

HFile

Memstore

CompactedHFile

Two or more HFiles get combinedto form a single HFile during

a compaction.

HFile HFile

Monday, April 1, 13

Page 37: Building apps with HBase - Data Days Texas March 2013

Hands-­‐on:  Java  client  to  HBase

• Create  a  table  from  the  HBase  shell• create  ‘users’  ,  ‘info’

• Write  a  program  to  do  the  following• Add  10  users  to  the  ‘users’  table.

• userid  as  rowkey• user’s  name  in  the  column  ‘info:name’• userid  could  just  be  user  +  serial  number.  eg:  user1,  user2,  user3

• Get  a  user  from  the  ‘users’  table  and  display  on  screen• Scan  the  en2re  ‘users’  table  and  display  on  screen

36

Monday, April 1, 13

Page 38: Building apps with HBase - Data Days Texas March 2013

DAO

• Like  any  other  applica7on• Makes  data  access  management  easy

• Cleaner  code• Table  info  at  a  single  place

• Op7miza7ons  like  connec7on  pooling  etc• Easier  management  of  database  configs

37

Monday, April 1, 13

Page 39: Building apps with HBase - Data Days Texas March 2013

AdminHBaseAdmin

• Administra7ve  API  to  the  cluster.

• Provides  an  interface  to  manage  HBase  database  tables  metadata  and  general  administra7ve  func7ons.

Table  Opera7ons

• createTable()

• deleteTable()

• addColumn()

• modifyColumn()

• deleteColumn()

• listTables()

Monday, April 1, 13

Page 40: Building apps with HBase - Data Days Texas March 2013

39

It’s  not  a  rela7onal  database  system

Data  Model

Monday, April 1, 13

Page 41: Building apps with HBase - Data Days Texas March 2013

HBase  is  ...

• Column  family  oriented  database• Column  family  oriented• Tables  consis2ng  of  rows  and  columns

• Persisted  Map• Sparse• Mul2  dimensional• Sorted• Indexed  by  rowkey,  column  and  2mestamp

• Key  Value  store• [rowkey,  col  family,  col  qualifier,  2mestamp]  -­‐>  cell  value

40

Monday, April 1, 13

Page 42: Building apps with HBase - Data Days Texas March 2013

HBase  is  not  ...

• A  rela7onal  database• No  SQL  query  language• No  joins• No  secondary  indexing• No  transac7ons

41

Monday, April 1, 13

Page 43: Building apps with HBase - Data Days Texas March 2013

Tabular  representa7on

42

Column Family - Info

password

abc123

abc123

abc123

[email protected]

TheRealMT

HMS_Surprise

[email protected]

Sir Arthur Conan Doyle [email protected]

[email protected]

SirDoyle

GrandpaD

Fyodor Dostoyevsky

Patrick O'Brien

Mark Twain

Rowkey name email

Cells

Each cell has multiple versions,

typically represented by the timestamp

of when they were inserted into the table

(ts2>ts1)

ts1=1329088321289 ts2=1329088818321

The table is lexicographicallysorted on the rowkeys

Langhorneabc123

12

3

4

The coordinates used to identify data in an HBase table are:(1) rowkey, (2) column family, (3) column qualifier, (4) version

Monday, April 1, 13

Page 44: Building apps with HBase - Data Days Texas March 2013

Key-­‐Value  store

43

[TheRealMT, info, password, 1329088818321]

[TheRealMT, info, password, 1329088321289]

abc123

Langhorne

Keys Values

A single KeyValue instance

Monday, April 1, 13

Page 45: Building apps with HBase - Data Days Texas March 2013

Key-­‐Value  store

44

[TheRealMT, info, password, 1329088818321] abc123

[TheRealMT, info, password] 1329088818321 : "abc123",1329088321289 : "Langhorne"

}

{

[TheRealMT, info] },

"name" : { 1329088321289 : "Mark Twain"

"email" : { 1329088321289 : "[email protected]" },

"password" : { 1329088818321 : "abc123", 1329088321289 : "Langhorne" } }

{

},

"info" : {

"name" : { 1329088321289 : "Mark Twain"

"email" : { 1329088321289 : "[email protected]" },

"password" : { 1329088818321 : "abc123", 1329088321289 : "Langhorne" } } }

{

[TheRealMT]

Keys

Values

Start with coordinates of full precision1

Drop version and you're left with a map of version to values2

Omit qualifier and you have a map of qualifiers to the previous maps3

Finally, drop the column family and you have a row, a map of maps4

Monday, April 1, 13

Page 46: Building apps with HBase - Data Days Texas March 2013

Sorted  map  of  maps

45

Rowkey

Column family

Column qualifiers

Versions

Values

{

},

"TheRealMT" : { "info" : {

"name" : { 1329088321289 : "Mark Twain"

"email" : { 1329088321289 : "[email protected]" },

"password" : { 1329088818321 : "abc123", 1329088321289 : "Langhorne" } } }}

Monday, April 1, 13

Page 47: Building apps with HBase - Data Days Texas March 2013

HFiles  and  physical  data  model

• HFiles  are• Immutable• Sorted  on  rowkey  +  qualifier  +  7mestamp• In  the  context  of  a  column  family  per  region

46

, , , "TheRealMT" "info" "password" , 1329088321289 "Langhorne", , , "TheRealMT" "info" "password" , 1329088818321 "abc123",

, , , , "TheRealMT" "info" "email" 1329088321289 "[email protected]", , , , "TheRealMT" "info" "name" 1329088321289 "Mark Twain"

HFile for the info column family in the users table

Monday, April 1, 13

Page 48: Building apps with HBase - Data Days Texas March 2013

47

...  what  really  goes  on  under  the  hood

HBase  in  distributed  mode

Monday, April 1, 13

Page 49: Building apps with HBase - Data Days Texas March 2013

Distributed  mode

• Table  is  split  into  Regions

48

00001 John 415-111-123400002 Paul 408-432-992200003 Ron 415-993-212400004 Bob 818-243-998800005 Carly 206-221-912300006 Scott 818-231-256600007 Simon 425-112-987700008 Lucas 415-992-443200009 Steve 530-288-983200010 Kelly 916-992-123400011 Betty 650-241-119200012 Anne 206-294-1298

00001 John 415-111-123400002 Paul 408-432-992200003 Ron 415-993-212400004 Bob 818-243-9988

00005 Carly 206-221-912300006 Scott 818-231-256600007 Simon 425-112-987700008 Lucas 415-992-4432

00009 Steve 530-288-983200010 Kelly 916-992-123400011 Betty 650-241-119200012 Anne 206-294-1298

00001 John 415-111-123400002 Paul 408-432-992200003 Ron 415-993-212400004 Bob 818-243-9988

00005 Carly 206-221-912300006 Scott 818-231-256600007 Simon 425-112-987700008 Lucas 415-992-4432

00009 Steve 530-288-983200010 Kelly 916-992-123400011 Betty 650-241-119200012 Anne 206-294-1298

Full table T1 Table T1 split into 3 regions - R1, R2 and R3

T1R1

T1R2

T1R3

Monday, April 1, 13

Page 50: Building apps with HBase - Data Days Texas March 2013

Distributed  mode  (con7nued)

• Regions  are  served  by  RegionServers

• RegionServers  are  Java  processes,  typically  collocated  with  the  DataNodes

49

DataNode RegionServer

DataNode RegionServer

DataNode RegionServer

00001 John 415-111-123400002 Paul 408-432-992200003 Ron 415-993-212400004 Bob 818-243-9988

00005 Carly 206-221-912300006 Scott 818-231-256600007 Simon 425-112-987700008 Lucas 415-992-4432

00009 Steve 530-288-983200010 Kelly 916-992-123400011 Betty 650-241-119200012 Anne 206-294-1298

Host1

Host2

Host3

T1R1

T1R2

T1R3

Monday, April 1, 13

Page 51: Building apps with HBase - Data Days Texas March 2013

Finding  your  region

• Two  special  tables:  -­‐ROOT-­‐  and  .META.

50

-ROOT-

M1 M2 M3

-ROOT- table, consisting of only 1 regionhosted on region server RS1

.META. table, consisting of 3 regions,hosted on RS1, RS2 and RS3

User tables T1 and T2, consisting of4 and 3 regions respectively,

distributed amongst RS1, RS2 and RS3

RS1

RS2

T2R1

RS2

T2R2

RS2

T2R3

RS1

RS1RS3

T1R1

RS3

T1R3

RS3

T1R2

RS1

T2R4

RS1

Monday, April 1, 13

Page 52: Building apps with HBase - Data Days Texas March 2013

Regions  distributed  across  the  cluster

51

DataNode RegionServer

00001 John 415-111-123400002 Paul 408-432-992200003 Ron 415-993-212400004 Bob 818-243-9988

00005 Carly 206-221-912300006 Scott 818-231-256600007 Simon 425-112-987700008 Lucas 415-992-4432

00009 Steve 530-288-983200010 Kelly 916-992-123400011 Betty 650-241-119200012 Anne 206-294-1298

Host1

Host2

M:T100001-M:T100009 M1 RS2M:T100010-M:T100012 M2 RS1

T1:00001-T1:00004 R1 RS1T1:00005-T1:00008 R2 RS2

T1:00009-T1:00012 R3 RS2

T1-R1

T1-R2

T1-R3

.META. - M1

-ROOT-

.META. - M2

RegionServer 1 (RS1), hosting regions R1 of the user table T1 and M2 of .META.

RegionServer 2 (RS2), hosting regions R2, R3 of the user table and M1 of .META.

RegionServer 3 (RS3), hosting just -ROOT-

Host3

DataNode RegionServer

DataNode RegionServer

Monday, April 1, 13

Page 53: Building apps with HBase - Data Days Texas March 2013

What  the  client  does

52

ZooKeeper

TheRealMT

-ROOT-

.META.

MyTable

Monday, April 1, 13

Page 54: Building apps with HBase - Data Days Texas March 2013

What  the  client  does

52

ZooKeeper

TheRealMT

-ROOT-

.META.

MyTable

Monday, April 1, 13

Page 55: Building apps with HBase - Data Days Texas March 2013

What  the  client  does

52

ZooKeeper

TheRealMT

-ROOT-

.META.

MyTable

Row per META region

Monday, April 1, 13

Page 56: Building apps with HBase - Data Days Texas March 2013

What  the  client  does

52

ZooKeeper

TheRealMT

-ROOT-

.META.

MyTable

Row per META region

Row per table region

Monday, April 1, 13

Page 57: Building apps with HBase - Data Days Texas March 2013

What  the  client  does

52

ZooKeeper

TheRealMT

-ROOT-

.META.

MyTable

Row per META region

Row per table region

Monday, April 1, 13

Page 58: Building apps with HBase - Data Days Texas March 2013

53

...  they  play  well  together

HBase  and  Hadoop

Monday, April 1, 13

Page 59: Building apps with HBase - Data Days Texas March 2013

HBase  and  Hadoop

• HBase  has  7ght  integra7on  with  Hadoop• Two  separate  concerns

• Access  via  MapReduce• Storage  on  HDFS

54

Monday, April 1, 13

Page 60: Building apps with HBase - Data Days Texas March 2013

Parallel  processing

55

Monday, April 1, 13

Page 61: Building apps with HBase - Data Days Texas March 2013

MapReduce

56

Monday, April 1, 13

Page 62: Building apps with HBase - Data Days Texas March 2013

HBase  and  MapReduce

• HBase  is  7ghtly  integrated  with  the  Hadoop  stack• MapReduce• HDFS

• Source  for  MapReduce  jobs• Read  from  an  HBase  Table  with  a  Map-­‐Reduce  Job  (Export,  Row  Count,  ...)

• Sink  for  MapReduce  jobs• Wri7ng  to  an  HBase  Table  from  a  Map-­‐Reduce  Job

• Lookups  during  MapReduce  jobs• Join  between  hbase  and/or  another  source

57

Monday, April 1, 13

Page 63: Building apps with HBase - Data Days Texas March 2013

HBase  as  MR  source

• One  mapper  per  region

58

RegionServer

RegionServer

RegionServer

….

Map Tasks - one per region.The tasks take the key range

of the region as their input splitand scan over it

Regions beingserved by theRegion Server

Monday, April 1, 13

Page 64: Building apps with HBase - Data Days Texas March 2013

HBase  as  MR  source  (con7nued)

• Mapperprotected void map(ImmutableBytesWritable  rowkey,  Result  result,

Context context) { // mapper logic goes here }

• Job  setup Scan scan = new Scan(); scan.addColumn(Bytes.toBytes("twits"), Bytes.toBytes("twit")); TableMapReduceUtil.initTableMapperJob("twits", scan, Map.class, ImmutableBytesWritable.class, Result.class, job);

59

Monday, April 1, 13

Page 65: Building apps with HBase - Data Days Texas March 2013

HBase  as  MR  sink

• Write  to  regions  via  MR  job

60

RegionServer

RegionServer

RegionServer

….

Reduce tasks writing to HBaseregions. Reduce tasks don't necessarily

write to a region on the same physical host. They will write to whichever region contains

the key range that they are writing into.This could potentially mean that all reduce

tasks talk to all regions on the cluster

Regions beingserved by theRegion Server

Monday, April 1, 13

Page 66: Building apps with HBase - Data Days Texas March 2013

HBase  as  MR  sink  (con7nued)

• Reducerprotected void reduce(ImmutableBytesWritable  rowkey,

Iterable<Put> values, Context context) { // reduce logic goes here }

• Job  setupTableMapReduceUtil.initTableReducerJob("users",

Reducer.class, job);

61

Monday, April 1, 13

Page 67: Building apps with HBase - Data Days Texas March 2013

HBase  as  a  shared  resource

• Lookups  from  MR  jobs• Map-­‐side  joins• Simple  Get/Scan  calls

62

Datanode

RegionServer

Datanode

RegionServer

Datanode

RegionServer

….

Map Tasks readingfiles from HDFS as

their sourceand doing randomgets or short scans

from HBase

Regions beingserved by theRegion Server

HBase

Monday, April 1, 13

Page 68: Building apps with HBase - Data Days Texas March 2013

Integra7on  with  HDFS

• HDFS  is  a  filesystem• Underlying  storage  fabric  for  HBase• HBase  7ghtly  integrated  with  HDFS  and  leverages  the  seman7cs

• LSM  trees  fit  well  into  a  write  once  filesystem• Availability  (fault  tolerance)  and  reliability  (durability)  at  scale

63

Monday, April 1, 13

Page 69: Building apps with HBase - Data Days Texas March 2013

Availability  and  fault  tolerance

64

HDFS

Physical host

Region Server

Datanode

Region Server Region Server Region Server

HDFS

Datanode Datanode Datanode

HFile1 HFile1 HFile1

Physical host Physical host Physical host

Region Server

Datanode

Region Server Region Server Region Server

Datanode Datanode Datanode

HFile1 HFile1 HFile1

Physical host Physical host Physical host

Serving region R1

Serving region R1

Physical host

Disaster strikes onthe host which was serving

region R1

Another server picks up theregion and starts serving it

from the persisted HFilefrom HDFS. The lost copy ofthe HFile will get replicated

to another Datanode

Monday, April 1, 13

Page 70: Building apps with HBase - Data Days Texas March 2013

Durability

• WAL  keeps  writes  intact  even  if  RS  fails• HDFS  is  reliable  and  doesn’t  lose  data  if  individual  nodes  fail

65

Monday, April 1, 13

Page 71: Building apps with HBase - Data Days Texas March 2013

Summary

• Big  data  =  new  products,  applica7ons• Hadoop  ecosystem  -­‐  poster  child• HBase  serves  many  use  cases• Allows  to  solve  aspects  of  big  data  problems• Simple  Java  API• Tight  Hadoop  integra7on• Low  latency  access  to  large  amounts  of  data• Large  scale  computa7on  using  MR

66

Monday, April 1, 13

Page 72: Building apps with HBase - Data Days Texas March 2013

67

Monday, April 1, 13

Page 73: Building apps with HBase - Data Days Texas March 2013

68

...  it’s  a  database  aper-­‐all

Schema  design

Monday, April 1, 13

Page 74: Building apps with HBase - Data Days Texas March 2013

But  isn’t  HBase  schema-­‐less?

• Number  of  tables• Rowkey  design  • Number  of  column  families  per  table.  What  goes  into  what  column  family

• Column  qualifier  names• What  goes  into  the  cells• Number  of  versions

69

Monday, April 1, 13

Page 75: Building apps with HBase - Data Days Texas March 2013

Rowkeys

• Rowkey  design  is  the  single  most  important  aspect  of  HBase  table  designs

• The  only  way  to  address  rows  in  HBase

70

Monday, April 1, 13

Page 76: Building apps with HBase - Data Days Texas March 2013

TwitBase  rela7onships

• Users  follow  users• Rela2onships  need  to  be  persisted  for  usage  later  on• Model  tables  for  the  expected  access  paSerns• Read  paSern

• Who  does  A  follow?• Who  follows  A?• Does  A  follow  B?

• Write  paSern• A  follows  B• A  unfollows  B

71

Monday, April 1, 13

Page 77: Building apps with HBase - Data Days Texas March 2013

Start  simple

• Adjacency  list

72

Column Family : followsrow key:userid

cell value: followed userid

column qualifier: followed user number

4:HRogers1:HRogers

3:Olivia1:TheRealMTTheFakeMT2:Olivia

2:MTFanBoyTheRealMT

followsCol Qualifier

Cell value

Monday, April 1, 13

Page 78: Building apps with HBase - Data Days Texas March 2013

Op7mizing  the  adjacency  list

• We  need  a  count• Where  does  a  new  followed  user  go?

73

2:Olivia count:2count:4TheFakeMT 4:HRogers3:Olivia2:MTFanBoy1:TheRealMT

TheRealMT 1:HRogers

follows

Monday, April 1, 13

Page 79: Building apps with HBase - Data Days Texas March 2013

Adding  a  new  user

74

2:Olivia count:2count:4TheFakeMT 4:HRogers3:Olivia2:MTFanBoy1:TheRealMT

TheRealMT 1:HRogers

follows

Row that needs to be updated

Client code:Step 1: Get current countStep 2: Update countStep 3: Add new entryStep 4: Write the new data to HBase

1

2

TheFakeMT : follows: {count -> 4}

increment count

TheFakeMT : follows: {count -> 5}

3 add new entry

TheFakeMT : follows: {5 -> MTFanBoy2, count -> 5}

count:52:Olivia count:2

5:MTFanBoy2TheFakeMT 4:HRogers3:Olivia2:MTFanBoy1:TheRealMTTheRealMT 1:HRogers

follows

4

Monday, April 1, 13

Page 80: Building apps with HBase - Data Days Texas March 2013

Transac7ons  ==  not  good

• HBase  doesn’t  have  na7ve  support  (think  scale)• Don’t  want  to  complicate  client  side  logic• Only  solu7on  -­‐>  simplify  schema

75

Olivia:1MTFanBoy:1TheFakeMTOlivia:1HRogers:1

HRogers:1TheRealMT:1TheRealMT

follows

Monday, April 1, 13

Page 81: Building apps with HBase - Data Days Texas March 2013

Revisit  the  ques7ons

• Read  paGern• Who  all  does  A  follow?• Who  all  follows  A?• Does  A  follow  B?

• Write  paGern• A  follows  B• A  unfollows  B

76

Monday, April 1, 13

Page 82: Building apps with HBase - Data Days Texas March 2013

Revisit  the  ques7ons

76

Monday, April 1, 13

Page 83: Building apps with HBase - Data Days Texas March 2013

Denormaliza7on

• Second  table  for  reverse  rela7onship• Otherwise  scan  across  en7re  table  and  affect  read  performance

77

DenormalizationPoor design

DreamlandNormalization

Read performance

Writ

e pe

rform

ance

Monday, April 1, 13

Page 84: Building apps with HBase - Data Days Texas March 2013

More  op7miza7ons

• Convert  into  tall-­‐narrow  table• Leverage  rowkey  indexing  beGer• Gets  -­‐>  short  Scans

78

CF : f

row key:follower + followed

cell value: 1

CQ: followed user's nameThe + in the row key refers to concatenating

the two values. You could delimitateusing any character you like.

eg: A-B or A,B

Keeping the column family and column qualifiernames short reduces the data transferred over thenetwork back to the client. The KeyValue objects

become smaller.

Monday, April 1, 13

Page 85: Building apps with HBase - Data Days Texas March 2013

Tall-­‐narrow  table  example

• Denormaliza7on  is  the  way  to  go

79

TheRealMT+HRogers Henry Rogers:1Olivia Clemens:1TheRealMT+Olivia

Amandeep Khurana:1TheFakeMT+MTFanBoyOlivia Clemens:1

Mark Twain:1TheFakeMT+TheRealMT

TheFakeMT+OliviaHenry Rogers:1TheFakeMT+HRogers

f Putting the user name in the columnqualifier saves you from looking upthe users table for the name of theuser given an id. You can simply

list out names or ids while lookingat relationships just from this table.

The downside of this is that you needto update the name in all the cellsif the user updates their name in

their profile.This is classic Denormalization.

Monday, April 1, 13

Page 86: Building apps with HBase - Data Days Texas March 2013

Uniform  rowkey  length

• MD5  the  userids  -­‐>  16  bytes  +  16  bytes  rowkeys• BeGer  distribu7on  of  load

80

CF : f

row key:md5(follower)md5(followed)

cell value: followed users name

CQ: followed useridUsing MD5 of the user ids gives you

fixed lengths instead of variablelength user ids. You don't needconcatenation logic anymore.

Monday, April 1, 13

Page 87: Building apps with HBase - Data Days Texas March 2013

Uniform  rowkey  length  (con7nued)

81

MD5(TheRealMT) MD5(HRogers) HRogers:Henry RogersOlivia:Olivia ClemensMD5(TheRealMT) MD5(Olivia)

MTFanBoy:Amandeep KhuranaMD5(TheFakeMT) MD5(MTFanBoy)Olivia:Olivia Clemens

TheRealMT:Mark TwainMD5(TheFakeMT) MD5(TheRealMT)

MD5(TheFakeMT) MD5(Olivia)HRogers:Henry RogersMD5(TheFakeMT) MD5(HRogers)

f

Monday, April 1, 13

Page 88: Building apps with HBase - Data Days Texas March 2013

Tall  v/s  Wide  tables  storage  footprint

82

r5 c3:v5 c7:v8c1:v1

r4 c2:v4

c2:v3 c5:v6r3

c1:v2 c3:v6r2

c6:v2c1:v9c1:v1r1

CF1 CF2

r1:CF1:c1:t1:v1r2:CF1:c1:t2:v2r2:CF1:c3:t3:v6r3:CF1:c2:t1:v3r4:CF1:c2:t1:v4r5:CF1:c1:t2:v1r5:CF1:c3:t3:v5

r1:CF2:c1:t1:v9r1:CF2:c6:t4:v2r3:CF2:c5:t4:v6r5:CF2:c7:t3:v8

HFile for CF1 HFile for CF2

r5:cf2:c7:t3:v8r5:CF1:c3:t3:v5r5:CF1:c1:t2:v1

Result object returned for a Get() on row r5

KeyValue objects

Cell Value

TimeStamp

Col Qual

Col Fam

Row Key

Key Value

Logical representation of an HBase table.We'll look at what it means to Get() row r5 from this table. Actual physical storage of the table

Structure of a KeyValue object

Monday, April 1, 13

Page 89: Building apps with HBase - Data Days Texas March 2013

Rowkey  design

• Single  most  important  aspect  of  designing  tables• Depends  on  expected  access  paGerns• HFiles  are  sorted  on  Key  part  of  KeyValue  objects

83

, , , "TheRealMT" "info" "password" , 1329088321289 "Langhorne", , , "TheRealMT" "info" "password" , 1329088818321 "abc123",

, , , , "TheRealMT" "info" "email" 1329088321289 "[email protected]", , , , "TheRealMT" "info" "name" 1329088321289 "Mark Twain"

HFile for the info column family in the users table

Monday, April 1, 13

Page 90: Building apps with HBase - Data Days Texas March 2013

Write  op7mized

• Distribute  writes  across  the  cluster• Issue  most  pronounced  with  7me  series  data

• Hashinghash("TheRealMT") -> random byte[]

• Sal7ngint salt = new Integer(new Long(timestamp).hashCode()).shortValue() % <number of region servers>;byte[] rowkey = Bytes.add(Bytes.toBytes(salt) + Bytes.toBytes("|") + Bytes.toBytes(timestamp));

84

Monday, April 1, 13

Page 91: Building apps with HBase - Data Days Texas March 2013

Read  op7mized

• Data  to  be  accessed  together  should  be  stored  together• eg:  twit  streams  -­‐  last  10  twits  by  the  users  I  follow

85

Olivia1Olivia2Olivia5Olivia7Olivia9TheFakeMT2TheFakeMT3TheFakeMT4TheFakeMT5TheFakeMT6TheRealMT1TheRealMT2TheRealMT5TheRealMT8

1Olivia1TheRealMT2Olivia2TheFakeMT2TheRealMT3TheFakeMT4TheFakeMT5Olivia5TheFakeMT5TheRealMT6TheFakeMT7Olivia8TheRealMT9Olivia

Monday, April 1, 13

Page 92: Building apps with HBase - Data Days Texas March 2013

Rela7onal  to  Non  rela7onal

• Rela7onal  concepts• En77es• AGributes• Rela7onships

• En77es• Table  is  a  table.  Not  much  going  on  there• Users  table  contains...  users.  Those  are  en77es

• Good  place  to  start.  Denormaliza7on  bends  that  rule  a  bit

86

Monday, April 1, 13

Page 93: Building apps with HBase - Data Days Texas March 2013

Rela7onal  to  Non  rela7onal  (con7nued)

• AGributes• Iden7fying

• Primary  keys.  Compound  keys• Maps  to  rowkeys

• Non-­‐iden7fying• Other  columns• Maps  to  column  qualifiers  and  cells

• Rela7onships

87

Monday, April 1, 13

Page 94: Building apps with HBase - Data Days Texas March 2013

Nested  En77es

• Column  Qualifiers  can  contain  data  instead  of  just  a  column  name

88

hbase tablerow key

column family

repeating entity

fixed qualifier → timestamp → value

variable qualifier → timestamp → value

Nested entities

Monday, April 1, 13

Page 95: Building apps with HBase - Data Days Texas March 2013

Schema  design  summary

• Schema  can  make  or  break  the  performance  you  get• Rowkey  is  the  single  most  important  thing

• Use  tricks  like  hashing  and  sal2ng• Denormalize  to  your  advantage

• There  are  no  joins• Isolate  access  paSerns

• Separate  CFs  or  even  separate  tables• Shorter  names  -­‐>  lower  storage  footprint• Column  qualifiers  can  be  used  to  store  data  and  not  just  column  names

• Big  difference  as  compared  to  RDBMS

89

Monday, April 1, 13

Page 96: Building apps with HBase - Data Days Texas March 2013

90

...  for  some  real  stuff,  beyond  your  laptop

Deployments  and  Opera2ons

Monday, April 1, 13

Page 97: Building apps with HBase - Data Days Texas March 2013

Components

• HDFS  (DataNodes)• Storage

• ZooKeeper• Membership  management

• RegionServers• Serve  the  regions

• HBase  Masters• Janitorial  work

91

Monday, April 1, 13

Page 98: Building apps with HBase - Data Days Texas March 2013

Hardware

• DataNodes  and  RegionServers  collocated• Plenty  of  RAM  and  Disk• 8-­‐12  cores• 48  GB  RAM• 12  x  1  TB  disks

92

Monday, April 1, 13

Page 99: Building apps with HBase - Data Days Texas March 2013

Hardware  (con7nued)

• HBase  Master  and  ZooKeeper  collocated• Don’t  necessarily  need  to  be  though• Keep  odd  number  of  ZKs

• ~3  is  good  enough  for  upto  50-­‐70  node  cluster  typically

• Give  ZK  independent  spindle  for  log  wri7ng• I/O  latency  sensi7ve

• 4-­‐6  cores• 24  GB  RAM

93

Monday, April 1, 13

Page 100: Building apps with HBase - Data Days Texas March 2013

Types  of  deployments• Standalone  for  dev  purpose• Prototype

• <  10  nodes• No  real  SLAs• 1  ZK  +  Master,  rest  are  RS

• Small  produc2on  cluster• 10-­‐20  nodes• 1  ZK  +  Master,  maybe  NNHA,  rest  are  RS

• Medium  produc2on  cluster• 20-­‐50  nodes• 3  ZK  +  Master,  NNHA,  rest  are  RS

• Large  produc2on  cluster• 50+  nodes• 3  or  5  ZK  +  Master,  NNHA,  rest  are  RS

• Hardware  choices  remain  the  same  mostly

94

Monday, April 1, 13

Page 101: Building apps with HBase - Data Days Texas March 2013

Deploying  sopware  and  configura7on

• Automated  sopware  bits  deployment  highly  recommended• Chef• Puppet• Cloudera  Manager

• Centralized  configura7on  management  highly  recommended• Consistency• Ease  of  management

95

Monday, April 1, 13

Page 102: Building apps with HBase - Data Days Texas March 2013

Configura7ons

• $HBASE_HOME/conf• hbase-­‐env.sh

• Environment  configura2ons

• hbase-­‐site.xml• HBase  system  configura2ons

• Linux  configura2ons• Number  of  open  files• swappiness

• HDFS  configura2ons• $HADOOP_HOME/hdfs-­‐site.xml

96

Monday, April 1, 13

Page 103: Building apps with HBase - Data Days Texas March 2013

Monitoring

• Metrics  context  from  Hadoop• Ganglia• File  context

• JMX• Cac7

• Several  metrics  available• Write  path  specific• Read  path  specific

97

Monday, April 1, 13

Page 104: Building apps with HBase - Data Days Texas March 2013

Monitoring  (con7nued)

98

Monday, April 1, 13

Page 105: Building apps with HBase - Data Days Texas March 2013

Performance  tuning

• Performance  tes2ng• Use  real  (or  synthe2c)  workloads• YCSB

• Tuning• Write  op2mized

• Memstore

• Random-­‐read  op2mized• Block  cache• Smaller  block  size

• Sequen2al-­‐read  op2mized• Block  cache• Higher  block  size

• GC  tuning

99

Monday, April 1, 13

Page 106: Building apps with HBase - Data Days Texas March 2013

100

Monday, April 1, 13