hbase in alibaba search business division

29
HBase in Alibaba Search Business Division

Upload: victorunique

Post on 18-Jul-2015

130 views

Category:

Technology


4 download

TRANSCRIPT

PowerPoint

HBase in Alibaba Search Business Division Self-introductionHi, Im Victor Xu (Name in Chinese Character: ), and I also have a nick name in Alibaba Group: or Yu Tian.Ive been working in the Alibaba Group since I graduated from Huazhong University of Science and Technology, Wuhan, China in 2009.Mainly focus on the Web Spider and HBase Storage fields in the Search Technology Business Division of Alibaba Group.ContactsEmail: [email protected]: victoruniqueWechat: victoruniqueWeibo: AgendaHBase in Ali-SearchImprovements & MaintenanceExtensional ProjectsFutureQ & AHBase in Ali-Search

HBase

Who is using HBase?

5Upgrade HistoryTotal number of nodes: from dozens to several thousandsSingle cluster nodes number: from dozens to nearly one thousandStorage and computing were mixed together.6Scenario 1 Taobao Search HBaseHQueueSearch EngineTaobao MySQLInc Syncer (iStream)Total Syncer (M/R Job)Inc Dump (iStream)Total Dump (M/R Job)Taobao data: billions of items Tmall data: hundreds of millions of items7iStream on YARNScenario 2 PORAPora syncHQueueHBaseiStream on YARNauctionuserPora userPora auctionUser LogTaobao MySQLSearch EnginePORA: Personal Offline Real-time AnalyzeUser log data: tens of billions of records per dayParameter Server for Distributed Machine Learning8Scenario 3 Web CrawlingHBaseWebFetcherETLSchedulerData MiningInc DumpSearch EngineTotal DumpWeb data: tens of billions of pages9Improvements & Maintenance

HBaseLower Disk I/OGenerate HFile directly onto the node which holds the HDFS replica of target region, then bulkload it with high locality, saving the I/O of major compaction. (HBase-12596)HDFSRegion ARegion BRegion CBLK A1BLK A2BLK B1BLK B2BLK C1BLK C2HFileOutputFormat11Limit BandwidthRegion ServerRegion ARegion BRegion C Traffic ControllerRemote Read/Write RequestsScanGetPutDeleteNew RPC Throttling Feature: https://issues.apache.org/jira/browse/HBASE-1159812Offline Region MergeOnline region merge mechanism is so slow that we need to find another way to merge thousands (maybe tens of thousands ) of regions at a time.Incremental Trigger

IC(based on Observer) FeaturesWrite a message to a certain HQueue when a Put matches IC configuration.The message can be a raw Rowkey or a JSON-formatted String with data from the original Put operation.The configuration support a variety of OPERATORs. For example: STRING_EQ, STRING_NE, INT_LT, INT_GE, FLOAT_EQ, FLOAT_LE, FLOAT_GTInstall/Uninstall IC on HTables dynamically, without restarting region servers or disabling tables.

14Region Split PolicySet a constant limit for each family, if not, use the region max size limit instead. Region split will be triggered if any family reaches its size limit.The split point is determined by the family who exceeds the most proportion of its size limit.For example:SizeOf(F1) = 5M, SizeOf(F2) = 15M, SizeOf(F3) = 10MLimitOf(F1)= 10M, LimitOf(F2) = 14M, LimitOf(F3) = 8MOriginal Region Split PolicyF1F2F3New Region Split PolicyF1F2F3Cluster AvailabilityThe strategy we find and deal with sick Region Server.Other OptimizationsEnhanced simple balance strategyEnhanced rolling upgradeCustomized tableInputFormatMore ganglia metrics for client requests

Extensional Projects

OverviewOpenTSDB - an open-source, distributed time series databasePhoenix - a SQL skin over HBaseHQueue - a distributed and persistent message-oriented middlewareHTunnel - a WAL tracker and deliverer for HBaseHBaseOpenTSDBPhoenixHQueueHTunnelHDFSUser Applications19OpenTSDB

Usage:Read/Write requests trendsApplication report gragh20Phoenix

Usage:HistoryServer & HStatsReal-time SQL requests21Whats HQueue?HQueue is a distributed and persistent message-oriented middleware based on HBase.

As a lightweight wrapper of HTable, HQueue can upgrade with HBase cluster seamlessly.High performance in both read and write.Compressed rowkeyMessage is classified by Topic. (HBase column qualifier)Support TTL mechanism. (HBases KeyValue level TTL)22HQueue in HBase

HQueue is based on HTable, so it also supports auto-failover, multi-partition(HTable region) and load balance.Its a persistent storage system. (HBase HLog & HDFS Append)Customized compaction

23HQueue Subscription

Support subscription24Why HTunnel?

Replace Increment Coprocessor soon.25HTunnel DAG

26FutureHBase-1.X (Multi-WAL HBASE-5699)HBase-2.X (HBase Read HA HBASE-10070)Tiered Storage Support in HDFS (HDFS-2832)Phoenix with Merged Index (PHOENIX-1801)

27

29