Download - It’s all about SCALE!!
It’s all about SCALE!!
How to scale up web service in the past ?
Source: http://www.slideshare.net/mallipeddi/scalable-lamp-development-for-growing-web-apps
大型網站所使用的工具Perlbal - http://www.danga.com/perlbal/
多個網頁伺服器的負載平衡MogileFS - http://www.danga.com/mogilefs/
分散式檔案系統 有公司認為 MogileFS 比起 Hadoop 適合拿來處理小檔案
memcached - http://memcached.org/共享記憶體 ??把資料庫或其他需要經常讀取的部分,用記憶體快取
(Cache) 方式存放Moxi - http://code.google.com/p/moxi/
Memcache 的 PROXY
More Resource: http://code.google.com/p/memcached/wiki/HowToLearnMoreScalability http://www.slideshare.net/techdude/scalable-web-architectures-common-
patterns-and-approaches
Source: http://mashraqi.com/2008/07/memcached-for-mysql-advanced-use-cases_09.html
Source: http://www.slideshare.net/northscale/moxi-memcached-proxy
Source: http://www.slideshare.net/northscale/moxi-memcached-proxy
HBase is a distributed column-oriented database built on top of
HDFS.
HBase is ..
A distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage.
Designed to operate on top of the Hadoop distributed file system (HDFS) or Kosmos File System (KFS, aka Cloudstore) for scalability, fault tolerance, and high availability.
Integrated into the Hadoop map-reduce platform and paradigm.
Benefits
Distributed storage
Table-like in data structure multi-dimensional map
High scalability
High availability
High performance
Who use HBase
Adobe – 內部使用 (Structure data)
Kalooga – 圖片搜尋引擎 http://www.kalooga.com/
Meetup – 社群聚會網站 http://www.meetup.com/
Streamy – 成功從 MySQL 移轉到 Hbase http://www.streamy.com/
Trend Micro – 雲端掃毒架構 http://trendmicro.com/
Yahoo! – 儲存文件 fingerprint 避免重複 http://www.yahoo.com/
More - http://wiki.apache.org/hadoop/Hbase/PoweredBy
Backdrop Started toward by Chad Walters and Jim 2006.11
Google releases paper on BigTable 2007.2
Initial HBase prototype created as Hadoop contrib. 2007.10
First useable HBase 2008.1
Hadoop become Apache top-level project and HBase becomes subproject
2008.10~ HBase 0.18, 0.19 released
HBase Is Not …
Tables have one primary index, the row key.
No join operators.
Scans and queries can select a subset of available columns, perhaps by using a wildcard.
There are three types of lookups: Fast lookup using row key and optional timestamp. Full table scan Range scan from region start to end.
HBase Is Not …(2)
Limited atomicity and transaction support. HBase supports multiple batched mutations of
single rows only. Data is unstructured and untyped.
No accessed or manipulated via SQL. Programmatic access via Java, REST, or
Thrift APIs. Scripting via JRuby.
Why Bigtable?
Performance of RDBMS system is good for transaction processing but for very large scale analytic processing, the solutions are commercial, expensive, and specialized.
Very large scale analytic processing Big queries – typically range or table scans. Big databases (100s of TB)
Why Bigtable? (2)
Map reduce on Bigtable with optionally Cascading on top to support some relational algebras may be a cost effective solution.
Sharding is not a solution to scale open source RDBMS platforms
Application specific Labor intensive (re)partitionaing
Why HBase ?
HBase is a Bigtable clone. It is open source It has a good community and promise for
the future It is developed on top of and has good
integration for the Hadoop platform, if you are using Hadoop already.
It has a Cascading connector.
HBase benefits than RDBMS
No real indexes
Automatic partitioning
Scale linearly and automatically with new nodes
Commodity hardware
Fault tolerance
Batch processing
Data Model Tables are sorted by Row Table schema only define it’s column families .
Each family consists of any number of columns Each column consists of any number of versions Columns only exist when inserted, NULLs are free. Columns within a family are sorted and stored together
Everything except table names are byte[] (Row, Family: Column, Timestamp) Value
Row key
Column Family
valueTimeStamp
Members
Master Responsible for monitoring region servers Load balancing for regions Redirect client to correct region servers The current SPOF
regionserver slaves Serving requests(Write/Read/Scan) of Client Send HeartBeat to Master Throughput and Region numbers are scalable by
region servers
Regions
表格是由一或多個 region 所構成 Region 是由其 startKey 與 endKey 所指定
每個 region 可能會存在於多個不同節點上,而且是由數個 HDFS 檔案與區塊所構成,這類 region 是由 Hadoop 負責複製
實際個案討論 – 部落格 邏輯資料模型
一篇 Blog entry 由 title, date, author, type, text 欄位所組成。 一位 User 由 username, password 等欄位所組成。 每一篇的 Blog entry 可有許多 Comments 。 每一則 comment 由 title, author, 與 text 組成。
ERD
部落格 – HBase Table Schema
Row key type ( 以 2 個字元的縮寫代表 ) 與 timestamp 組合而成。 因此 rows 會先後依 type 及 timestamp 排序好。方便用 scan () 來存取
Table 的資料。 BLOGENTRY 與 COMMENT 的”一對多”關係由 comment_title,
comment_author, comment_text 等 column families 內的動態數量的 column來表示
每個 Column 的名稱是由每則 comment 的 timestamp 來表示,因此每個column family 的 column 會依時間自動排序好
Architecture
ZooKeeper
HBase depends on ZooKeeper (Chapter 13) and by default it manages a ZooKeeper instance as the authority on cluster state
Operation The -ROOT-
table holds the list of .META. table regions
The .META. table holds the list of all user-space regions.
Installation (1)
$ wget http://ftp.twaren.net/Unix/Web/apache/hadoop/hbase/hbase-0.20.3/hbase-0.20.3.tar.gz$ sudo tar -zxvf hbase-*.tar.gz -C /opt/$ sudo ln -sf /opt/hbase-0.20.3 /opt/hbase$ sudo chown -R $USER:$USER /opt/hbase
$ sudo mkdir /var/hadoop/
$ sudo chmod 777 /var/hadoop
啟動 Hadoop…
Setup (1) $ vim /opt/hbase/conf/hbase-env.sh export JAVA_HOME=/usr/lib/jvm/java-6-sun
export HADOOP_CONF_DIR=/opt/hadoop/confexport HBASE_HOME=/opt/hbaseexport HBASE_LOG_DIR=/var/hadoop/hbase-logsexport HBASE_PID_DIR=/var/hadoop/hbase-pidsexport HBASE_MANAGES_ZK=trueexport HBASE_CLASSPATH=$HBASE_CLASSPATH:/opt/hadoop/conf
$ cd /opt/hbase/conf$ cp /opt/hadoop/conf/core-site.xml ./$ cp /opt/hadoop/conf/hdfs-site.xml ./$ cp /opt/hadoop/conf/mapred-site.xml ./
Setup (2)<configuration>
<property> <name> name </name> <value> value </value> </property>
</configuration>
Name value
hbase.rootdir hdfs://secuse.nchc.org.tw:9000/hbase
hbase.tmp.dir /var/hadoop/hbase-${user.name}
hbase.cluster.distributed true
hbase.zookeeper.property.clientPort
2222
hbase.zookeeper.quorum Host1, Host2
hbase.zookeeper.property.dataDir
/var/hadoop/hbase-data
Startup & Stop
全部啟動 / 關閉$ bin/start-hbase.sh
$ bin/stop-hbase.sh 個別啟動 / 關閉
$ bin/hbase-daemon.sh start/stop zookeeper
$ bin/hbase-daemon.sh start/stop master
$ bin/hbase-daemon.sh start/stop regionserver
$ bin/hbase-daemon.sh start/stop thrif
$ bin/hbase-daemon.sh start/stop rest
Testing (4)$ hbase shell> create 'test', 'data'0 row(s) in 4.3066 seconds> listtest1 row(s) in 0.1485 seconds> put 'test', 'row1', 'data:1', 'value1'0 row(s) in 0.0454 seconds> put 'test', 'row2', 'data:2', 'value2'0 row(s) in 0.0035 seconds> put 'test', 'row3', 'data:3', 'value3'0 row(s) in 0.0090 seconds
> scan 'test'ROW COLUMN+CELLrow1 column=data:1, timestamp=1240148026198,
value=value1row2 column=data:2, timestamp=1240148040035,
value=value2row3 column=data:3, timestamp=1240148047497,
value=value33 row(s) in 0.0825 seconds> disable 'test'09/04/19 06:40:13 INFO client.HBaseAdmin: Disabled test0 row(s) in 6.0426 seconds> drop 'test'09/04/19 06:40:17 INFO client.HBaseAdmin: Deleted test0 row(s) in 0.0210 seconds> list0 row(s) in 2.0645 seconds
Connecting to HBase Java client
get(byte [] row, byte [] column, long timestamp, int versions);
Non-Java clients Thrift server hosting HBase client instance
Sample ruby, c++, & java (via thrift) clients REST server hosts HBase client
TableInput/OutputFormat for MapReduce HBase as MR source or sink
HBase Shell JRuby IRB with “DSL” to add get, scan, and admin ./bin/hbase shell YOUR_SCRIPT
Thrift
a software framework for scalable cross-language services development.
By facebook seamlessly between C++, Java, Python, PHP, and Ruby. This will start the server instance, by default on port
9090 The other similar project “rest”
$ hbase-daemon.sh start thrift$ hbase-daemon.sh stop thrift
References
<趨勢科技 >HBase 介紹 http://www.wretch.cc/blog
/trendnop09/21192672
Hadoop: The Definitive Guide Book, by Tom White
HBase Architecture 101 http://www.larsgeorge.com/2009/10/hbase-
architecture-101-storage.html