hive introduction 介绍

33
Hive Hive 介介 介介 周周周 周周周 2013.4.18 2013.4.18

Upload: ablozhou

Post on 02-Dec-2014

1.458 views

Category:

Technology


9 download

DESCRIPTION

hive introduction, hql, udf,udaf

TRANSCRIPT

Page 1: Hive  introduction 介绍

Hive Hive 介绍介绍Hive Hive 介绍介绍

周海汉 周海汉 2013.4.182013.4.18

Page 2: Hive  introduction 介绍

目录目录目录目录• Hive Hive 简介简介• Hive Hive 特性特性• HiveQLHiveQL• UDFUDF• 小技巧小技巧• 讨论讨论

Page 3: Hive  introduction 介绍

Hive Hive 简介简介Hive Hive 简介简介• 官网 官网 http://hive.apache.ohttp://hive.apache.o

rg/rg/• 最新版本 最新版本 0.100.10• facebookfacebook 贡献给贡献给 apacheapache

Page 4: Hive  introduction 介绍

Hive Hive 模式模式Hive Hive 模式模式• Metadb : embedded Derby database,mysql,otherMetadb : embedded Derby database,mysql,other• local mode: Derby local mode: Derby ,, one userone user ,, one jobone job• distribute mode: mysqldistribute mode: mysql ,, multi usermulti user

Page 5: Hive  introduction 介绍

支持支持 HadoopHadoop 版本版本支持支持 HadoopHadoop 版本版本• hadoop 0.20~hadoop 0.20~• hadoop 0.23~hadoop 0.23~

Page 6: Hive  introduction 介绍

HiveHive 特性特性HiveHive 特性特性

Page 7: Hive  introduction 介绍

Hive Hive 特性特性Hive Hive 特性特性• 数据仓库数据仓库• HiveQLHiveQL• HDFS & HBaseHDFS & HBase• ^A ^A 分隔的行分隔的行

Page 8: Hive  introduction 介绍

HiveHive 特性特性HiveHive 特性特性

Page 9: Hive  introduction 介绍

HiveQLHiveQLHiveQLHiveQL

Page 10: Hive  introduction 介绍

HiveQL - SQLHiveQL - SQL 部分子集部分子集HiveQL - SQLHiveQL - SQL 部分子集部分子集• No Update or Delete statements.No Update or Delete statements.• each query tables only from one database each query tables only from one database • not support IN/EXISTS, Having clausenot support IN/EXISTS, Having clause• ......

Page 11: Hive  introduction 介绍

HiveQL - HiveQL - 超出超出 SQLSQL 部分部分HiveQL - HiveQL - 超出超出 SQLSQL 部分部分• 复杂数据结构复杂数据结构• structstruct• array array • mapmap• ......

Page 12: Hive  introduction 介绍

HiveQL - HiveQL - 自带的部分函数自带的部分函数HiveQL - HiveQL - 自带的部分函数自带的部分函数• 统计:统计:

– sum,count,avg,min,maxsum,count,avg,min,max– 总体标准差函数总体标准差函数 : stddev_pop: stddev_pop– 样本标准差函数样本标准差函数 : stddev_samp: stddev_samp– 中位数函数中位数函数 : percentile: percentile– 直方图直方图 : histogram_numeric: histogram_numeric

• 条件:条件:– ifif– casecase

Page 13: Hive  introduction 介绍

HiveQL - HiveQL - 自带的部分函数自带的部分函数HiveQL - HiveQL - 自带的部分函数自带的部分函数• 时间 时间 year date unix_timestamp ...year date unix_timestamp ...• 逻辑 逻辑 and or notand or not• 运算符 运算符 +-*/ % | & ^ ~+-*/ % | & ^ ~• 数学 数学 round floor ceil rand exp log log2 pow sqrt hex round floor ceil rand exp log log2 pow sqrt hex

sin ...sin ...• 字符串处理 字符串处理 trim substr length split get_json_object trim substr length split get_json_object

parse_url regexp_replace regexp_extractparse_url regexp_replace regexp_extract

Page 14: Hive  introduction 介绍

HiveQL HiveQL 示例示例 -- 创建创建 HDFSHDFS 文本外表文本外表HiveQL HiveQL 示例示例 -- 创建创建 HDFSHDFS 文本外表文本外表• CREATE EXTERNAL TABLE login(CREATE EXTERNAL TABLE login(• ldate string,ldate string,• userid int,userid int,• proid int,proid int,• imei string,imei string,• sysver string)sysver string)• ROW FORMAT DELIMITED FIELDS TERMINATEROW FORMAT DELIMITED FIELDS TERMINATE

D BY ' 'D BY ' '• LOCATIONLOCATION• 'hdfs://h46:9000/flume/loginlog''hdfs://h46:9000/flume/loginlog'

Page 15: Hive  introduction 介绍

HiveQL HiveQL 示例示例 -HBase -HBase 外表外表HiveQL HiveQL 示例示例 -HBase -HBase 外表外表• CREATE EXTERNAL TABLE lordstat_pid(CREATE EXTERNAL TABLE lordstat_pid(• key string COMMENT 'from deserializer',key string COMMENT 'from deserializer',• total int COMMENT 'from deserializer',total int COMMENT 'from deserializer',• win int COMMENT 'from deserializer',win int COMMENT 'from deserializer',• spring int COMMENT 'from deserializer')spring int COMMENT 'from deserializer')• ROW FORMAT SERDEROW FORMAT SERDE• 'org.apache.hadoop.hive.hbase.HBaseSerDe''org.apache.hadoop.hive.hbase.HBaseSerDe'• STORED BYSTORED BY• 'org.apache.hadoop.hive.hbase.HBaseStorageHandler''org.apache.hadoop.hive.hbase.HBaseStorageHandler'• WITH SERDEPROPERTIES (WITH SERDEPROPERTIES (• 'serialization.format'='1','serialization.format'='1',• 'hbase.columns.mapping'=':key,i:t,i:win,i:spr')'hbase.columns.mapping'=':key,i:t,i:win,i:spr')• TBLPROPERTIES (TBLPROPERTIES (• 'hbase.table.name'='lordstat_pid');'hbase.table.name'='lordstat_pid');

Page 16: Hive  introduction 介绍

HiveQL HiveQL 示例示例 -- 分区分区HiveQL HiveQL 示例示例 -- 分区分区• hive> create external table glog1(ldate string,ltime string ,threadid strihive> create external table glog1(ldate string,ltime string ,threadid stri

ng,userid int) partitioned by (pdate string) ROW FORMAT DELIMITEng,userid int) partitioned by (pdate string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';D FIELDS TERMINATED BY ' ';

• hive> alter table login add partition(ym='201303',d='28') LOCATIONhive> alter table login add partition(ym='201303',d='28') LOCATION 'hdfs://h46:9000/flume/loginlog/201303/28/'; 'hdfs://h46:9000/flume/loginlog/201303/28/';

Page 17: Hive  introduction 介绍

Hive UDFHive UDFHive UDFHive UDF

Page 18: Hive  introduction 介绍

Hive UDFHive UDFHive UDFHive UDF

• StreamingStreaming• UDFUDF• UDAFUDAF• UDTFUDTF

Page 19: Hive  introduction 介绍

streamingstreamingstreamingstreaming

• 分割字符串 分割字符串 pythonpython• def calcwin():def calcwin():• for line in sys.stdin:for line in sys.stdin:• (ldate,userid,roundbet,fold,allin,chipwon) = line.st(ldate,userid,roundbet,fold,allin,chipwon) = line.st

rip().split()rip().split()• print '\t'.join(["%s:%s"%(ldate,userid), win,fold,alliprint '\t'.join(["%s:%s"%(ldate,userid), win,fold,alli

n])n])

Page 20: Hive  introduction 介绍

streamingstreamingstreamingstreaming

• 用法类似用法类似• hive> from testpoker select transform(ldate,ltime,thrhive> from testpoker select transform(ldate,ltime,thr

eadid,gameid,userid,pid,roundbet,fold,allin,cardtype,eadid,gameid,userid,pid,roundbet,fold,allin,cardtype,cards,chipwon) using 'calcpoker.py' as (ldate,gameicards,chipwon) using 'calcpoker.py' as (ldate,gameid,userid,pid,win,fold,allin,cardtype,cards) ;d,userid,pid,win,fold,allin,cardtype,cards) ;

Page 21: Hive  introduction 介绍

UDFUDFUDFUDF

• public class UDFTest extends UDF {public class UDFTest extends UDF {• public Integer evaluate(String s) { public Integer evaluate(String s) { • if (s == null) { return null; } if (s == null) { return null; } • return s.length(); } return s.length(); } • }}

Page 22: Hive  introduction 介绍

UDFUDFUDFUDF

• add jar /path/testudf.jar; add jar /path/testudf.jar; • CREATE TEMPORARY FUNCTION testlength ASCREATE TEMPORARY FUNCTION testlength AS

'org.zhouhh.UDFTest'; 'org.zhouhh.UDFTest';• SELECT testlength(src.value) FROM src; SELECT testlength(src.value) FROM src;

Page 23: Hive  introduction 介绍

UDAFUDAFUDAFUDAF

• User-Defined Aggregation FuncationUser-Defined Aggregation Funcation• public class UDAFCount extends UDAF {public class UDAFCount extends UDAF {• public static class Evaluator implements UDAFEvaluator {public static class Evaluator implements UDAFEvaluator {• private int mCount; private int mCount; • public void init() { mcount = 0; } public void init() { mcount = 0; } • public boolean iterate(Object o) {public boolean iterate(Object o) {• if (o!=null) mCount++; if (o!=null) mCount++; • return true; } return true; } • public Integer terminatePartial() {return mCount; } public Integer terminatePartial() {return mCount; } • public boolean merge(Integer o) {public boolean merge(Integer o) {• mCount += o;mCount += o; return true;return true; } } • public Integer terminate() {return mCount; } }public Integer terminate() {return mCount; } }

Page 24: Hive  introduction 介绍

UDAFUDAFUDAFUDAF

• add jar /path/testudaf.jar; add jar /path/testudaf.jar; • CREATE TEMPORARY FUNCTION testcount AS 'oCREATE TEMPORARY FUNCTION testcount AS 'o

rg.zhouhh.rg.zhouhh.UDAFCount UDAFCount ';';• SELECT testcount(src.id) FROM src; SELECT testcount(src.id) FROM src;

Page 25: Hive  introduction 介绍

UDTF UDTF UDTF UDTF

• User-Defined Table-Generating FunctiUser-Defined Table-Generating Functionsons

• 解决 输入一行输出多行解决 输入一行输出多行 (On-to-many m(On-to-many maping) aping) 的需求的需求

Page 26: Hive  introduction 介绍

UDTF UDTF UDTF UDTF

• 继承继承 org.apache.hadoop.hive.ql.udf.generic.Genericorg.apache.hadoop.hive.ql.udf.generic.GenericUDTFUDTF 。。

• 实现实现 initialize, process, closeinitialize, process, close 三个方法三个方法

Page 27: Hive  introduction 介绍

UDTF UDTF UDTF UDTF

• 使用方法使用方法• 1. 1. 不可添加其他字段不可添加其他字段 ,, 不可不可 group bygroup by ,, sort bysort by 等等• select explode_map(properties) as (col1,col2) from select explode_map(properties) as (col1,col2) from

src;src;• 2.2. 用用 lateral viewlateral view• select src.id, mytable.col1, mytable.col2 from src latselect src.id, mytable.col1, mytable.col2 from src lat

eral view explode_map(properties) mytable as col1, eral view explode_map(properties) mytable as col1, col2;col2;

Page 28: Hive  introduction 介绍

小技巧小技巧小技巧小技巧

Page 29: Hive  introduction 介绍

小技巧小技巧小技巧小技巧• structstruct• 1015826235 [{"product_id":220003038067,"times1015826235 [{"product_id":220003038067,"times

tamps":"1340321132000"},{"product_id":300003861tamps":"1340321132000"},{"product_id":300003861266,"timestamps":"1340271857000"}]266,"timestamps":"1340271857000"}]

Page 30: Hive  introduction 介绍

小技巧小技巧小技巧小技巧• CREATE EXTERNAL TABLE IF NOT EXISTS SamCREATE EXTERNAL TABLE IF NOT EXISTS Sam

pleTablepleTable• (• USER_ID BIGINT,• NEW_ITEM ARRAY<STRUCT<PRODUCT_ID:

BIGINT,TIMESTAMPS:STRING>>)

Page 31: Hive  introduction 介绍

小技巧小技巧小技巧小技巧• SELECTSELECT• user_id,• prod_and_ts.product_id as product_id,• prod_and_ts.timestamps as timestamps• FROM • SampleTable • LATERAL VIEW explode(new_item)

exploded_table as prod_and_ts;

Page 32: Hive  introduction 介绍

小技巧小技巧小技巧小技巧• **USER_ID** | **PRODUCT_ID** | **TIMESTAMPS****USER_ID** | **PRODUCT_ID** | **TIMESTAMPS**• ------------+------------------+----------------------------+------------------+----------------• 1015826235 220003038067 13403211320001015826235 220003038067 1340321132000• 1015826235 300003861266 13402718570001015826235 300003861266 1340271857000

••

Page 33: Hive  introduction 介绍

讨论 ... 谢谢 !

http://abloz.com2013.4.18

@abloz