hive introduction 介绍
DESCRIPTION
hive introduction, hql, udf,udafTRANSCRIPT
Hive Hive 介绍介绍Hive Hive 介绍介绍
周海汉 周海汉 2013.4.182013.4.18
目录目录目录目录• Hive Hive 简介简介• Hive Hive 特性特性• HiveQLHiveQL• UDFUDF• 小技巧小技巧• 讨论讨论
Hive Hive 简介简介Hive Hive 简介简介• 官网 官网 http://hive.apache.ohttp://hive.apache.o
rg/rg/• 最新版本 最新版本 0.100.10• facebookfacebook 贡献给贡献给 apacheapache
Hive Hive 模式模式Hive Hive 模式模式• Metadb : embedded Derby database,mysql,otherMetadb : embedded Derby database,mysql,other• local mode: Derby local mode: Derby ,, one userone user ,, one jobone job• distribute mode: mysqldistribute mode: mysql ,, multi usermulti user
支持支持 HadoopHadoop 版本版本支持支持 HadoopHadoop 版本版本• hadoop 0.20~hadoop 0.20~• hadoop 0.23~hadoop 0.23~
HiveHive 特性特性HiveHive 特性特性
Hive Hive 特性特性Hive Hive 特性特性• 数据仓库数据仓库• HiveQLHiveQL• HDFS & HBaseHDFS & HBase• ^A ^A 分隔的行分隔的行
HiveHive 特性特性HiveHive 特性特性
HiveQLHiveQLHiveQLHiveQL
HiveQL - SQLHiveQL - SQL 部分子集部分子集HiveQL - SQLHiveQL - SQL 部分子集部分子集• No Update or Delete statements.No Update or Delete statements.• each query tables only from one database each query tables only from one database • not support IN/EXISTS, Having clausenot support IN/EXISTS, Having clause• ......
HiveQL - HiveQL - 超出超出 SQLSQL 部分部分HiveQL - HiveQL - 超出超出 SQLSQL 部分部分• 复杂数据结构复杂数据结构• structstruct• array array • mapmap• ......
HiveQL - HiveQL - 自带的部分函数自带的部分函数HiveQL - HiveQL - 自带的部分函数自带的部分函数• 统计:统计:
– sum,count,avg,min,maxsum,count,avg,min,max– 总体标准差函数总体标准差函数 : stddev_pop: stddev_pop– 样本标准差函数样本标准差函数 : stddev_samp: stddev_samp– 中位数函数中位数函数 : percentile: percentile– 直方图直方图 : histogram_numeric: histogram_numeric
• 条件:条件:– ifif– casecase
HiveQL - HiveQL - 自带的部分函数自带的部分函数HiveQL - HiveQL - 自带的部分函数自带的部分函数• 时间 时间 year date unix_timestamp ...year date unix_timestamp ...• 逻辑 逻辑 and or notand or not• 运算符 运算符 +-*/ % | & ^ ~+-*/ % | & ^ ~• 数学 数学 round floor ceil rand exp log log2 pow sqrt hex round floor ceil rand exp log log2 pow sqrt hex
sin ...sin ...• 字符串处理 字符串处理 trim substr length split get_json_object trim substr length split get_json_object
parse_url regexp_replace regexp_extractparse_url regexp_replace regexp_extract
HiveQL HiveQL 示例示例 -- 创建创建 HDFSHDFS 文本外表文本外表HiveQL HiveQL 示例示例 -- 创建创建 HDFSHDFS 文本外表文本外表• CREATE EXTERNAL TABLE login(CREATE EXTERNAL TABLE login(• ldate string,ldate string,• userid int,userid int,• proid int,proid int,• imei string,imei string,• sysver string)sysver string)• ROW FORMAT DELIMITED FIELDS TERMINATEROW FORMAT DELIMITED FIELDS TERMINATE
D BY ' 'D BY ' '• LOCATIONLOCATION• 'hdfs://h46:9000/flume/loginlog''hdfs://h46:9000/flume/loginlog'
HiveQL HiveQL 示例示例 -HBase -HBase 外表外表HiveQL HiveQL 示例示例 -HBase -HBase 外表外表• CREATE EXTERNAL TABLE lordstat_pid(CREATE EXTERNAL TABLE lordstat_pid(• key string COMMENT 'from deserializer',key string COMMENT 'from deserializer',• total int COMMENT 'from deserializer',total int COMMENT 'from deserializer',• win int COMMENT 'from deserializer',win int COMMENT 'from deserializer',• spring int COMMENT 'from deserializer')spring int COMMENT 'from deserializer')• ROW FORMAT SERDEROW FORMAT SERDE• 'org.apache.hadoop.hive.hbase.HBaseSerDe''org.apache.hadoop.hive.hbase.HBaseSerDe'• STORED BYSTORED BY• 'org.apache.hadoop.hive.hbase.HBaseStorageHandler''org.apache.hadoop.hive.hbase.HBaseStorageHandler'• WITH SERDEPROPERTIES (WITH SERDEPROPERTIES (• 'serialization.format'='1','serialization.format'='1',• 'hbase.columns.mapping'=':key,i:t,i:win,i:spr')'hbase.columns.mapping'=':key,i:t,i:win,i:spr')• TBLPROPERTIES (TBLPROPERTIES (• 'hbase.table.name'='lordstat_pid');'hbase.table.name'='lordstat_pid');
HiveQL HiveQL 示例示例 -- 分区分区HiveQL HiveQL 示例示例 -- 分区分区• hive> create external table glog1(ldate string,ltime string ,threadid strihive> create external table glog1(ldate string,ltime string ,threadid stri
ng,userid int) partitioned by (pdate string) ROW FORMAT DELIMITEng,userid int) partitioned by (pdate string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';D FIELDS TERMINATED BY ' ';
• hive> alter table login add partition(ym='201303',d='28') LOCATIONhive> alter table login add partition(ym='201303',d='28') LOCATION 'hdfs://h46:9000/flume/loginlog/201303/28/'; 'hdfs://h46:9000/flume/loginlog/201303/28/';
Hive UDFHive UDFHive UDFHive UDF
Hive UDFHive UDFHive UDFHive UDF
• StreamingStreaming• UDFUDF• UDAFUDAF• UDTFUDTF
streamingstreamingstreamingstreaming
• 分割字符串 分割字符串 pythonpython• def calcwin():def calcwin():• for line in sys.stdin:for line in sys.stdin:• (ldate,userid,roundbet,fold,allin,chipwon) = line.st(ldate,userid,roundbet,fold,allin,chipwon) = line.st
rip().split()rip().split()• print '\t'.join(["%s:%s"%(ldate,userid), win,fold,alliprint '\t'.join(["%s:%s"%(ldate,userid), win,fold,alli
n])n])
streamingstreamingstreamingstreaming
• 用法类似用法类似• hive> from testpoker select transform(ldate,ltime,thrhive> from testpoker select transform(ldate,ltime,thr
eadid,gameid,userid,pid,roundbet,fold,allin,cardtype,eadid,gameid,userid,pid,roundbet,fold,allin,cardtype,cards,chipwon) using 'calcpoker.py' as (ldate,gameicards,chipwon) using 'calcpoker.py' as (ldate,gameid,userid,pid,win,fold,allin,cardtype,cards) ;d,userid,pid,win,fold,allin,cardtype,cards) ;
UDFUDFUDFUDF
• public class UDFTest extends UDF {public class UDFTest extends UDF {• public Integer evaluate(String s) { public Integer evaluate(String s) { • if (s == null) { return null; } if (s == null) { return null; } • return s.length(); } return s.length(); } • }}
UDFUDFUDFUDF
• add jar /path/testudf.jar; add jar /path/testudf.jar; • CREATE TEMPORARY FUNCTION testlength ASCREATE TEMPORARY FUNCTION testlength AS
'org.zhouhh.UDFTest'; 'org.zhouhh.UDFTest';• SELECT testlength(src.value) FROM src; SELECT testlength(src.value) FROM src;
UDAFUDAFUDAFUDAF
• User-Defined Aggregation FuncationUser-Defined Aggregation Funcation• public class UDAFCount extends UDAF {public class UDAFCount extends UDAF {• public static class Evaluator implements UDAFEvaluator {public static class Evaluator implements UDAFEvaluator {• private int mCount; private int mCount; • public void init() { mcount = 0; } public void init() { mcount = 0; } • public boolean iterate(Object o) {public boolean iterate(Object o) {• if (o!=null) mCount++; if (o!=null) mCount++; • return true; } return true; } • public Integer terminatePartial() {return mCount; } public Integer terminatePartial() {return mCount; } • public boolean merge(Integer o) {public boolean merge(Integer o) {• mCount += o;mCount += o; return true;return true; } } • public Integer terminate() {return mCount; } }public Integer terminate() {return mCount; } }
UDAFUDAFUDAFUDAF
• add jar /path/testudaf.jar; add jar /path/testudaf.jar; • CREATE TEMPORARY FUNCTION testcount AS 'oCREATE TEMPORARY FUNCTION testcount AS 'o
rg.zhouhh.rg.zhouhh.UDAFCount UDAFCount ';';• SELECT testcount(src.id) FROM src; SELECT testcount(src.id) FROM src;
UDTF UDTF UDTF UDTF
• User-Defined Table-Generating FunctiUser-Defined Table-Generating Functionsons
• 解决 输入一行输出多行解决 输入一行输出多行 (On-to-many m(On-to-many maping) aping) 的需求的需求
UDTF UDTF UDTF UDTF
• 继承继承 org.apache.hadoop.hive.ql.udf.generic.Genericorg.apache.hadoop.hive.ql.udf.generic.GenericUDTFUDTF 。。
• 实现实现 initialize, process, closeinitialize, process, close 三个方法三个方法
UDTF UDTF UDTF UDTF
• 使用方法使用方法• 1. 1. 不可添加其他字段不可添加其他字段 ,, 不可不可 group bygroup by ,, sort bysort by 等等• select explode_map(properties) as (col1,col2) from select explode_map(properties) as (col1,col2) from
src;src;• 2.2. 用用 lateral viewlateral view• select src.id, mytable.col1, mytable.col2 from src latselect src.id, mytable.col1, mytable.col2 from src lat
eral view explode_map(properties) mytable as col1, eral view explode_map(properties) mytable as col1, col2;col2;
小技巧小技巧小技巧小技巧
小技巧小技巧小技巧小技巧• structstruct• 1015826235 [{"product_id":220003038067,"times1015826235 [{"product_id":220003038067,"times
tamps":"1340321132000"},{"product_id":300003861tamps":"1340321132000"},{"product_id":300003861266,"timestamps":"1340271857000"}]266,"timestamps":"1340271857000"}]
小技巧小技巧小技巧小技巧• CREATE EXTERNAL TABLE IF NOT EXISTS SamCREATE EXTERNAL TABLE IF NOT EXISTS Sam
pleTablepleTable• (• USER_ID BIGINT,• NEW_ITEM ARRAY<STRUCT<PRODUCT_ID:
BIGINT,TIMESTAMPS:STRING>>)
小技巧小技巧小技巧小技巧• SELECTSELECT• user_id,• prod_and_ts.product_id as product_id,• prod_and_ts.timestamps as timestamps• FROM • SampleTable • LATERAL VIEW explode(new_item)
exploded_table as prod_and_ts;
小技巧小技巧小技巧小技巧• **USER_ID** | **PRODUCT_ID** | **TIMESTAMPS****USER_ID** | **PRODUCT_ID** | **TIMESTAMPS**• ------------+------------------+----------------------------+------------------+----------------• 1015826235 220003038067 13403211320001015826235 220003038067 1340321132000• 1015826235 300003861266 13402718570001015826235 300003861266 1340271857000
••
•