hbase at line

中村俊介, Shunsuke Nakamura (LINE, twitter, facebook: sunsuk7tp)

NHN Japan Corp.

HBase at LINE~ How to grow our storage together with service ~

自己紹介中村俊介

• 2011.10 旧 Japan新卒入社 (2012.1から Japan)

• LINE server engineer, storage team

• Master of Science@東工大首藤研

• Distributed Processing, Cloud Storage, and NoSQL

• MyCassandra [CASSANDRA-2995]: A modular NoSQL with Pluggable Storage Engine based on Cassandra

• はてなインターン/インフラアルバイト

• NHN = Next Human Network• NAVER Korea: 検索ポータルサイト

• 韓国本社の検索シェア7割

• 元Samsungの社内ベンチャー

• NAVER = Navigate + er

• NAVER Japan

• Japanは今年で３年目

• 経営統合によりNAVERはサービス名、グループ、宗教

• LINE、まとめ、画像検索、NDrive

NHN/NAVER

• NAVER• Hangame • livedoor• データホテル• NHN ST • JLISTING• メディエータ• 深紅

韓国本社Green Factory

LINE is NHN Japan STAND-ALONE

8.17 5,500万users (日本 2,500万users) AppStore Ranking - Top 1

Japan, Taiwan, Thailand,, HongKong, Saudi, Malaysia, Bahrain, Jordan, Qatar, Singapore, Indonesia, Kazakhstan, Kuwait, Israel, Macau, Ukraine, UAE,

Switzerland, Australia, Turkey, Vietnam, Germany, Russian

LINE Roadmap2011.6 iPhone first release

Android first release

LINE Card/Camera/Brush

2011.8

2011.10I join to LINE team.

PC (Win/Mac), Multi device Sticker Shop

LINE platform

Bots (News, Auto-translation, Public account, Gurin)

Sticker

WP first release

BB first releaseTimeline

2012.6

2012.8

Target of LINE Storage

1. Performing well (put < 1ms, get < 5ms)

2. A high scalable, available, and eventually consistent storage system built around NoSQL

3. Geological distribution

43.2% 56.8%

Japan�Global�

future

LINE Storage and NoSQL1. Performing well

2. A high scalable, available, and eventually consistent storage system

3. Geological distribution

LINE launched with Redis.

At first,

Initial LINE Storage• Target: 1M DL within 2011

• Client-side sharding with a few Redis nodes

• Redis

• Performance: O(1) lookup

• Availability: Async snapshot + slave nodes (+ backup MySQL)

• Capacity: memory + Virtual Memory

app server queue

backup

storage

master

......

queuedispatcher

August 28~29 2011 Kuwait Saudi Arabia Qatar Bahrain…

1Million over

• Sharded Redis

• Shard coordination: ZKs + Manager Daemons(shard allocation using consistent hashing, health check&failover)

x 3 or 5

October 13 2011 Hong Kong

However, in fact...

��

100M DL within 2012

Billions of Messages/Day...

We had encountered so much problems every day in 2011.10...

Redis isNOT easily scalableNOT persistent

easily dies

2. A high scalable, available, and eventually consistent storage system built around

2011年内1Mユーザーを想定したストレージを、サービス無停止で2012年内1Bユーザーに対応する

Zuckerberg’s Law of Sharing (2011. July.7)

Y = C * 2 ^ X (Y: sharing data, X: time, C: constance)Sharing activity will double each year.

LINEのmessage数/月はいくら？

10億x30 = 300億 messages/month

Data and Scalability• constant

• DATA: async operation

• SCALE: thousands ~ millions per each queue

• linear

• DATA: users’ contact and group

• SCALE: millions ~ billions

• exponential

• DATA: message and message inbox

• SCALE: tens of billion ~ 0

2500000000

5000000000

7500000000

10000000000

constant linear exponential

• constant

• FIFO

• read&write fast

• linear

• zipf.

• read fast [w3~5 : r95]

• exponential

• latest

• write fast [w50 : r50]

Data and WorkloadQueue

Zipfian curve

Message timeline

Choosing Storage• constant: Redis

• linear, exponential: 選択肢幾つか

• HBase

• ◯ workload, NoSQL on DFSで運用しやすい (DFSスペシャリスト++)

• × SPOF, Random Readの99%ile性能がやや低い

• Cassandra

• ◯ workload, No SPOF (No Coordinator, rack/DC-aware replication)

• × Weak consistencyに伴う運用コスト, 実装が複雑 (特にCAS操作)

• MongoDB

• ◯ 便利機能 (auto-sharding/failover, various query) → 解析向けで不要

• × workload, 帯域やディスクの使い方悪い

• MySQL Cluster

• ◯ 使い慣れ (1サービス当たり最大数千台弱運用)

• × 最初から分散設計でwrite scalableものを使うべき

HBase• 数百TBを格納可能

• 大量データに対してwrite scalable, 効率的なrandom access

• Semi-structured model (< MongoDB, Redis)

• RDBMSの高級機能はもたない (TX, joins)

• Strong consistency per a row and columnfamily

• NoSQL constructed on DFS

• レプリカ管理不要 / Region移動が楽

• Multi-partition allocation per RS

• ad hocなload balancing

LINE Storage (2012.3)

app. server (nginx)

Message HBaseContact HBaseBackup MySQL

Thrift API / Authentication / Renderer

iPhone Android WAP

app. server (nginx)app. server (nginx)

async operationfailed operation

x 100 nodes

x 400 nodes

backup operationx 2 nodesx 100 nodes

x 25 Million

Sharded Redis clusters (message, contact, group)

Redis Queue

dispatcher dispatcher

Redis Queue

dispatcher

Redis Queue

app. server (nginx)

HDFS01

Primary HBaseBackup MySQL

Thrift API / Authentication / Renderer

phone (iPhone/Android/WP/blackberry/WAP) PC (win/mac)

app. server (nginx)app. server (nginx)

async operationfailed operation

x 200 nodes

x 600 nodes

backup operationx 2 nodes

x 200 nodes

x 50 Million

Msg HBase01

dispatcher

Redis Queue

HDFS02

Msg HBase02

Redis Queue

dispatcher dispatcher

Redis Queue

Sharded Redis clusters (message, contact, group)

LINE Storage (2012.7)

2012.3 → 2012.7• ユーザー数2倍、インフラ2倍

• まだHBaseにとってCasual Data

• Message HBaseはdual cluster

• message TTLに応じて切り替え (TTL: 2week → 3week)

• HDFS DNはHBase用のM/Rとしても利用

• Sharded-Redisがまだ基本プライマリ (400→600)

• messageはHBaseにもget

• 他はmodelのみをbackup

LINE Data on HBase• LINE data

• MODEL: <key> → <model>

• INDEX: <key> ↔ <property in model>

• User: <userId> → <User obj>, <userId> ↔ <phone>

• 各modelを1つのrowで表現

• HBaseのconsistency: 1つのrow, columnFamily単位でstrong consistencyを保証

• contactなどの複数modelをもつものはqualifier (column)を利用

• レンジクエリが必要なDataは一つのrowにまとめる (e.g. message Inbox)

• Cons.) column数に対してリニアにlatency大 → delete, search filter with timestamp

User ModeluserIdemail phone

timestamp, version• Column level timestamp

• modelのtimestampでindexを構築

• API実行timestampでasync, failure handling

• Search filterとしても利用 (Cons. TTLの利用不可)

• Multiple versioning

• 複数emailのbinding (e.g. Google account password history)

• CSの為のdata trace

Primary key for LINE• Long Type keyを元に生成: e.g. userId, messageId

• simple ^ random for single lookup

• range queryのためのlocalityの考慮不要

• prefix(key) + key

• prefix(key): ALPHABETS[key%26] + key%1000

• o.a.h.hbase.util.RegionSplitter.SplitAlgorithmを実装

• prefixでRegion splitting

a500a250 a750a000 b000

HRegion260026262652

2756 c2601 2602 d27822808

• Message, Inbox

• exponential scale

• immutable

Data stored in HBase

• User, Contact, Group

• linear scale

• mutable

Message, Inbox

• Sharded-Redisとのhybrid構成

• 片方から読み書きできればOK (< quorum)

• failed queryはJVM Heap,Redisにqueuing&retry

• immutable&idempotent query: 整合性, 重複の問題なし

performance, availability重視

• Sharded-Redisがまだprimary

• scalabilityの問題はない

• mutableなので整合性重要

• RedisからHBaseへ移行 (途中)

• Model Objectのみbackup

User, Contactperformanceよりconsistency重視

RedisからHBaseへ移行1. modelのbackup

• Redisにsync、HBaseにasync write (Java Future, Redis queuing)

2. M/Rを使ってSharded-Redisからfull migration

3. modelを元にindex/inverted index building (eventual) ←イマココ

• Batch Operation: w/ M/R, model table full-scan using TableMapper

• Incremental Operation: Diff logging and sequential indexing or Percolator, HBase Coprocessor

4. access path切り替え, Redis cache化

HBaseに置き換えたら幸せになれた？

ある意味ではYES

• Scalability Issuesが解決

•今年いっぱいまでは

•広域分散 → 3rd issue (To be continue...)

Failure Decreased?

ABSOLUTELY NOT!

HBaseを8ヶ月運用してみた印象

• HBaseは火山

• 毎日小爆発

• 蓄積してたまに大爆発

• 火山のふもとでの安全な暮らし

爆発• 断続的なネットワーク障害によるRS退役

• H/W障害によるDN性能悪化・検知の遅延

• get (get, increment, checkAndPut, checkAndDelete)性能劣化、それに伴う全体性能低下

• (major) compactionによる性能劣化

• データ不整合

• SPOF絡みの問題はまだ起こってない

HBaseのAvailability

• SPOF or 死ぬとdowntimeが発生する箇所が幾つか

1. HDFSのNameNode

2. HBaseのRegionServer, DataNode

1. HDFS NameNode (NN)

• HA Framework for HDFS NN (HDFS-1623)

• Backup NN (0.21)

• Avatar NN (Facebook)

• HA NN using Linux HA

• Active/passive configuration deploying two NN (cloudera)

HA NN using Linux-HA

• DRBD+heartbeatで冗長化

• DRBD: disk mirroring (RAID1-like)

• heartbeat: network monitoring

• pacemaker: resource management (failover logicの登録)

2. RegionServer, DataNode• HBase自体がレプリカをもたない

• failoverされるまでdowntime発生

• 複数コンポーネントで構成されているので、故障検知から全体合意まで、それぞれの通信区間でtimeoutを待たなければいけない

downtime対策• HBase自身がreplicaを持たないのでRS死亡時のdowntimeが必ず発生

• distributed HLog splitting (>=cdh3u3)

• timeout&retry

• ほとんどHClient ↔ RS間のtimeout時間

• timeout調整 (retryごと, operationごと)

• RS ↔ ZK間は短いとnetworkが不安定なときにRSが排除されやすい

• 同じkeyを持つregionを同じRSに配置 → 障害の限定化

• LINEのHBase accessは基本的にasync

• Cluster replication

HBase cluster replication• Cluster Replication: master push方式

• (MySQLのようなbinary logging mechanism), 馬本8.8章参照

• 非同期でWAL (HLog)をslave clusterにpush

• 各RSごとにSynchronous Call

• syncされていないWAL ListをZKが管理

• 検証しつつも、• 独自実装 or 他の手段も考慮中

multi-DC間のreplication向けではない

HDFS tuning for HBase

• Shortcut a local client reads to a Datanodes files directly > 0.23.1, 0.22.1, 1.0.0 (HDFS-2246)

• Rack-aware Replica Placement (HADOOP-692)

削除問題• 削除が少し低速

• 論理削除なのでgetほどではないが、putの2倍かかる

• 例) 1万件のコンタクトをもつユーザー退会処理

• カラム多すぎでクライアント側でtimeout → queuing + iterative delete

• 例) TTLが過ぎたmessage削除

• cold dataに対するRandom I/Oが発生し、serviceに影響

• → dual cluster, full-truncate or TTL利用

• 例) スパマー対応

• compactionされるまでのget性能 (大量のskip処理)への影響

• → column単位ではなく、row単位の削除に

Compaction対策

• Bigtable: I/O最適化と削除の為に定期的なCompaction処理が必要

• RSごとにQueuingされ同時に1 HRegionずつCompactionが実行される

• Compaction実行中にCPU利用率が上がるので、タイミング注意

• タイミング: periodic, StoreFile数, ユーザー実行

• peak-time時に連続して発生しないよう、off-peakにcompactionとregion splitting

Balancing, Splitting, and Compaction

• Region balancing

• 自動balancer (request数ベースのbalancing)はOFF

• serviceのoff-peak時にbalancing

• 異なるtableの同一keyは同じserverに割当→障害を限定

• 問題のあるRegion専用のserver: prison RS

• Region splittingとcompactionのスケジューリング

• 自動splitもなるべく避ける (hbase.hregion.max.filesizeで自動split)

• 連続的なmajor compactionを避ける

• immutable storageはperiodic compactionをOFF

HBase Tools• Client:

• HBaseTemplate: HBase Wrapper like spring’s RedisTemplate

• MirroringHTable: 複数HBase cluster対応

• 運用監視:

• auto splitting: off-peak時のregion split

• auto HRegion balancer: metricsを元にoff-peak balancing

• Region snapshot&restore: META Tableをdaily dump、RS死亡時の復元

• Data Migration:

• Migrator with M/R (Redis → HBase)

• H2H copy tool with M/R: table copy (HBase → HBase)

• metrics collecting via JMX

• Index Builder and Inconsistent Fixer with M/R, incremental implementation (coprocessor)

今後の課題• HBase上の<key, model>を中心にindexやRedis上にcacheを構築

• 停電・地震対策 (rack/dc-awareness)

• HBase cluster replication

• Cassandraをgeological distributed storage for HLogとして利用

• 今以上のスケーラビリティ (数 - 数十億ユーザー)

• HBaseはnetwork-boundで1クラスタ数百台弱が限界

• Multi-clusterで凌ぐかCassandraを使うか

hbase at line

ha nn

high scalable

hbase

hdfs

redis

contact

key

auto

Technology

hbase the definitive guide chapter 04

line presentation at tech in asia meetup

une introduction à hbase

liberty on hbase 20091113

20120423 hbase勉強会

apache hbase overview (20160427)

apache hbase 0.98

hbase up & running - stratebi

hbase - zonder notities

hbase train stark - community.qingcloud.com · hbase...

hbase orm framework simplehbase0.5 introduction

hbase data types

cassnadra vs hbase

cours hbase et base de données orientées colonnes (hbase,...

Базы данных. hbase

cloudera hbase training and certification

hbase programming

hbase 資料庫應用

hfile,compact and split of hbase

new hbase practice at xiaomi - github...