apache kylin2.0...

37
Apache Kylin2.0 传统OLAP实时数据仓库演进 马洪宾 | [email protected] PMC member of Apache Kylin Kyligence Inc. 术合伙人 & 级架构师 All rights reserved ©Kyligence Inc. http://kyligence.io

Upload: others

Post on 05-Sep-2019

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

ApacheKylin2.0从传统OLAP向实时数据仓库演进

马洪宾 | [email protected]

PMC member of Apache KylinKyligence Inc.技术合伙人 &高级架构师

All rights reserved ©Kyligence Inc.http://kyligence.io

Page 2: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

Apache Kylin历史

Sep 2013

项目开始

Oct 2014

加入Apache孵化器项目

Nov 2014

InfoWorld: Bossie Award

最佳开源大数据工具奖

毕业成为Apache顶级项目

Kyligence公司创建

Sep 2015 Nov 2015 Mar 2016

正式开源

InfoWorld: Bossie Award

最佳开源大数据工具奖

Sep 2016

商业版KAP发布

Aug 2016

All rights reserved ©Kyligence Inc.http://kyligence.io

Page 3: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

关于Kyligence• Kyligence’svisionistounleashbigdataproductivityforeveryone's

analyticsneeds.

• ThecompanywasfoundedbytheteamwhocreatedApacheKylin™,atopopensourceOLAPenginebuiltforinteractiveanalyticsatpetabyte-scaledataonHadoop.KyligenceistheprimarycontributortotheopensourceKylinprojectglobally.

• Kyligenceprovidesaleadingintelligentdataplatformtosimplifybigdataanalyticsfromon-premisestocloud.

All rights reserved ©Kyligence Inc.http://kyligence.io

Page 4: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

Apache Kylin全球案例

All rights reserved ©Kyligence Inc.http://kyligence.io

Page 5: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

All rights reserved ©Kyligence Inc.http://kyligence.io

BI可视化

HDFS

ApacheKylin

Hive HBase

Interactive Reporting Dashboard

OLAP引擎

Hadoop

- 3万亿条数据,<1秒查询延迟@头条,国内第一新闻资讯app

- 60+维度的cube@太平洋保险,中国三大保险公司之一

- JDBC/ODBC/RestAPI

- BI集成

ApacheKylin是什么

Page 6: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

Apache Kylin in the Zoo

All rights reserved ©Kyligence Inc.http://kyligence.io

Offline Cubing

Kylin

BITools,WebApp…

ANSISQL

Page 7: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

Kylin为什么快

All rights reserved ©Kyligence Inc.http://kyligence.io

selectl_returnflag,o_orderstatus,sum(l_quantity)assum_qty,sum(l_extendedprice)assum_base_price…

fromv_lineiteminnerjoin v_orders onl_orderkey =o_orderkey

wherel_shipdate <='1998-09-16'

groupbyl_returnflag,o_orderstatus

orderbyl_returnflag,o_orderstatus;

Asamplequery:Reportrevenueby“returnflag”and“orderstatus”Sort

Aggr.

Filter

TablesO(N)

Join

Page 8: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

Kylin为什么快

All rights reserved ©Kyligence Inc.http://kyligence.io

Sort

Cuboid

Filter

Sort

Aggr.

Filter

TablesO(N)

Join

O(flagxstatusxdays)=O(1)

Pre-calculatetheKylinCube

Page 9: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

Kylin关键在于预计算

All rights reserved ©Kyligence Inc.http://kyligence.io

time, item

time, item, location

time, item, location, supplier

time item location supplier

time, location

Time, supplier

item, location

item, supplier

location, supplier

time, item, supplier

time, location, supplier

item, location, supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

- 基于cube理论- Model和 Cube定义了预计算的范围- BuildEngine执行构建任务- QueryEngine在预计算的结果之上完成查询

Page 10: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

O(1)复杂度

All rights reserved ©Kyligence Inc.http://kyligence.io

OnlineCalculation

O(N)

O(1)ApacheKylin

DataSize

ResponseTime

Page 11: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

All rights reserved ©Kyligence Inc.http://kyligence.io

近实时流数据处理

构建分钟级别延迟的cube

Page 12: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

New in Kylin 1.6

All rights reserved ©Kyligence Inc.http://kyligence.io

Offline Cubing

Kylin

BITools,WebApp…

ANSISQL

New

Page 13: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

ThanksSee you on our next meeting

Page 14: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

DemoofTwitterAnalysis

All rights reserved ©Kyligence Inc.http://kyligence.io

http://hub.kyligence.io

Incrementalbuildtriggersevery2minutes,buildfinishesin3minutes.

- 8-nodeclusteronAWS,3Kafkabrokers

- Twittersamplefeed,10+Kmessagespersecond

- Cubehas9dimensionsand3measures

- 2jobsrunningatthesametime

Page 15: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity
Page 16: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

All rights reserved ©Kyligence Inc.http://kyligence.io

SparkCubing减少一半的构建时间

Page 17: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

Layered Cubing

All rights reserved ©Kyligence Inc.http://kyligence.io

标准的构建算法:Layered Cubing

- 启动多轮MR任务

- 将大型shuffle切分到多个stage

- 稳定,但是在构建时间上并不是最优的

Page 18: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

In-memory Cubing

All rights reserved ©Kyligence Inc.http://kyligence.io

In-memory Cubing是对Layered Cubing的强力补充

- 在某些条件下触发

- 并不适用于所有场景

- 一旦被触发,往往拥有更好性能

Page 19: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

CubingwithMR总结

All rights reserved ©Kyligence Inc.http://kyligence.io

比较稳定

Layered Cubing在某些场景下性能都有待提高

In-mem Cubing适用场景有限

社区迫不及待地想要尝试使用其他技术来加速cubing

Page 20: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

ApacheSpark介绍

All rights reserved ©Kyligence Inc.http://kyligence.io

ApacheSparkisanopen-sourcecluster-computingframework,whichprovidesprogrammerswithanapplicationprogramminginterfacecenteredonadatastructurecalledRDD.

SparkwasdevelopedinresponsetolimitationsintheMapReduceclustercomputingparadigm.

SparkrunsonHadoop,Mesos,standalone,orinthecloud.ItcanaccessdiversedatasourcesincludingHDFS,Cassandra,HBase,andS3.

Page 21: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

Spark Cubing:一次失败的尝试

All rights reserved ©Kyligence Inc.http://kyligence.io

Kylin1.5曾经尝试使用过Spark Cubing,但是从未正式发布

- 它只是简单地将In-memory Cubing移植到Spark上

- 使用一轮RDD转换计算整个cube

- 并未观察到明显改进

- Spark计算方式与MR并无明显区别

Page 22: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

SparkCubingin2.0

All rights reserved ©Kyligence Inc.http://kyligence.io

RDD-1

RDD-2

RDD-3

RDD-4

RDD-5

Kylin2.0基于LayeredCubing算法重新打造了Spark Cubing

- 每一层的cuboid视作一个RDD

- 父亲RDD被尽可能cache到内存

- RDD被导出到sequencefile,

- 通过将 “map”替换为“flatMap”,以及把“reduce”替换为 “reduceByKey”,可以复用大部分代码

Page 23: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

计算第三层cuboid的DAG

All rights reserved ©Kyligence Inc.http://kyligence.io

Page 24: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

Performance Test

All rights reserved ©Kyligence Inc.http://kyligence.io

• Environment4nodesHadoopcluster;eachnodehas28GBRAMand12coresYRANhas48GBRAMand30coresintotalCDH5.8,ApacheKylin2.0beta

• SparkSpark1.6.3onYARN6executors,eachhas4cores,4GB+1GB(overhead)memory

• TestDataAirlinedata,total160millionrowsCube:10dimensions,5measures(SUM)

• TestScenariosBuildthecubeatdifferentsourcedatascale:3million,50millionand160millionsourcerowsComparethebuildtimewithMapReduce(both bylayer and in-mem)andSpark.NocompressionThetimeonlycoverthebuildingcubestep,notincludingpreparationsandsubsequentsteps

Page 25: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

SparkCubingvs.MRLayeredCubing

All rights reserved ©Kyligence Inc.http://kyligence.io

Build Cube Time Comparison

Build

Tim

e (m

inut

e)

0

8

15

23

30

Source data rows

3-million 50-million 160-millionMapReduce Spark MapReduce Spark MapReduce Spark

17

8

3

29

19

6

70%to130%improvementonSpark

Page 26: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

SparkCubingvs.MRIn-memCubing

All rights reserved ©Kyligence Inc.http://kyligence.io

Page 27: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

结论

All rights reserved ©Kyligence Inc.http://kyligence.io

• Layered cubingalgorithmis stableonbothMRandSparkSpark

• ComparedwithMRlayered Cubing,70%to130%performanceimprovement is observed on

Spark Cubing

• Whensourcedatais sharded,Sparkstillkeeps closeperformancewithMRin-memcubing

• There’s still room for improvement

Page 28: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

All rights reserved ©Kyligence Inc.http://kyligence.io

雪花模型的支持

运行TPC-Hbenchmark

Page 29: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

Kylin1.0StarSchemaLimitation

All rights reserved ©Kyligence Inc.http://kyligence.io

Pre-calculatethejoinof1leveloflookups

- Supportstarschemaonly- Not allowsamenamecolumnsfromdifferenttables- Not allowtablejoiningitself- Difficulttosupportrealworldbusinesscases

LINEORDER

DATES

PART

CUSTOMER

SUPPLIER

Join

LINEORDER

DATES PART

CUSTOMERSUPPLIER

Page 30: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

Kylin2.0SnowflakeSchema

All rights reserved ©Kyligence Inc.http://kyligence.io

Pre-calculateunlimitedlevelsoflookups

- Snowflakeschemasupport(KYLIN-1875)

- Allowtablebejoinedmultipletimes

- BigmetadatachangeatModellevel

- Manybugfixesregardingjoinsandsub-queries

- Supportcomplexmodelsofanykind,supportflexiblequeriesonthemodels

ORDERS

CUSTOMER

SUPPLIER

PART

LINEITEM

PARTSUPP

NATION

REGION

Join

Join

Join

Join

Join

Page 31: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

TPC-HonKylin2.0

All rights reserved ©Kyligence Inc.http://kyligence.io

TPC-H isabenchmarkfordecisionsupportsystem.

- PopularamongcommercialRDBMS&DWsolutions- Queriesanddatahavebroadindustry-widerelevance- Examinelargevolumesofdata- Executequerieswithahighdegreeofcomplexity- Giveanswerstocriticalbusinessquestions

Kylin2.0runsallthe22TPC-Hqueries.(KYLIN-2467)

- Pre-calculationcananswerverycomplexqueries- Goalisfunctionalityatthisstage- Tryit:https://github.com/Kyligence/kylin-tpch

ORDERS

CUSTOMER SUPPLIER PART

LINEITEM

PARTSUPPNATION

REGION

Page 32: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

ComplexQuery1

All rights reserved ©Kyligence Inc.http://kyligence.io

TPC-Hquery07- 0.17sec(Hive+Tez 35.23sec)- 2sub-queriesselectsupp_nation,cust_nation,l_year,sum(volume) asrevenue

from(selectn1.n_nameassupp_nation,n2.n_nameascust_nation,l_shipyear asl_year,l_saleprice asvolume

fromv_lineiteminnerjoinsupplier ons_suppkey =l_suppkeyinnerjoinv_orders onl_orderkey =o_orderkeyinnerjoincustomerono_custkey =c_custkeyinnerjoinnationn1ons_nationkey =n1.n_nationkeyinnerjoinnationn2onc_nationkey =n2.n_nationkey

where((n1.n_name='KENYA'andn2.n_name='PERU')or(n1.n_name='PERU'andn2.n_name='KENYA')

)andl_shipdate between'1995-01-01'and'1996-12-31'

)asshippinggroupbysupp_nation,cust_nation,l_year

orderbysupp_nation,cust_nation,l_year

Sort

Aggr.

Filter

LINEITEM

Join

Proj.

Join

Join

Join

Join SUPPLIER

ORDER

CUSTOMER

NATION

NATION

Page 33: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

All rights reserved ©Kyligence Inc.http://kyligence.io

TPC-Hquery11- 3.42sec(Hive+Tez 15.87sec)- 4sub-queries,1onlinejoinwithq11_part_tmp_cachedas(selectps_partkey,sum(ps_partvalue) aspart_value

fromv_partsuppinnerjoin supplier onps_suppkey =s_suppkeyinnerjoin nationons_nationkey =n_nationkey

wheren_name ='GERMANY'

groupbyps_partkey),q11_sum_tmp_cached as(selectsum(part_value) astotal_value

fromq11_part_tmp_cached

)

selectps_partkey,part_value

from(selectps_partkey,part_value,total_value

fromq11_part_tmp_cached, q11_sum_tmp_cached

)awherepart_value >total_value *0.0001

orderbypart_value desc;

Sort

Filter

Join

Proj.

Aggr.

Filter

Join

Proj.

Join SUPPLIER

NATION

Aggr.

Filter

Join

Proj.

Join SUPPLIER

NATION

PARTSUPP

PARTSUPP

Proj.

Aggr.

ComplexQuery2

Page 34: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

All rights reserved ©Kyligence Inc.http://kyligence.io

TPC-Hquery12- 7.66sec(Hive+Tez 12.64sec)- 5sub-queries,2onlinejoinswithin_scope_data as(selectl_shipmode,o_orderpriority

fromv_lineitem innerjoinv_orders on l_orderkey =o_orderkey

wherel_shipmode in('REGAIR','MAIL')andl_receiptdelayed =1andl_shipdelayed =0andl_receiptdate >='1995-01-01'andl_receiptdate <'1996-01-01'

),all_l_shipmode as(selectdistinctl_shipmode

fromin_scope_data

),high_line as(selectl_shipmode,count(*)ashigh_line_count

fromin_scope_data

whereo_orderpriority ='1-URGENT'oro_orderpriority ='2-HIGH'

groupby l_shipmode),low_line as(selectl_shipmode,count(*)aslow_line_count

fromin_scope_data

whereo_orderpriority <>'1-URGENT'ando_orderpriority <>'2-HIGH'

groupby l_shipmode)selectal.l_shipmode,hl.high_line_count, ll.low_line_count

fromall_l_shipmode alleftjoinhigh_line hlonal.l_shipmode =hl.l_shipmodeleftjoinlow_line ll onal.l_shipmode =ll.l_shipmode

orderbyal.l_shipmode

ComplexQuery3

Sort

Filter

Join

Join

Aggr.

Filter

Join

Proj.

ORDERS

LINEITEMAggr.

Filter

Join

Proj.

ORDERS

LINEITEM

Aggr.

Filter

Join

Proj.

ORDERS

LINEITEM

Page 35: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

MorethanMOLAP

All rights reserved ©Kyligence Inc.http://kyligence.io

- Supportscomplexdatamodelsandsub-queries;RunsTPC-H- Percentile/Window/Timefunctions

SQLMaturity

Speed

Kylin2.0Kylin1.0

DWonHadoop

MOLAP AnalyticsDW

Page 36: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

总结

All rights reserved ©Kyligence Inc.http://kyligence.io

ApacheKylin2.0

- Kylin2.0Beta可供下载.

- Sparkcubing

- 雪花模型的支持- 可尝试的TPC-Hbenchmark- 时间函数/窗口函数/百分比函数

- 近实时流式处理

Whatisnext

- Hadoop3.0支持(ErasureCoding)

- 完善Spark Cubing

- 连接更多数据源(JDBC,SparkSQL)

- 替换存储层(Kudu?)

- 支持真正实时 lambdaarchitecture

Page 37: Apache Kylin2.0 从传统OLAP向实时数据仓库演进bos.itdks.com/7417ed920daf4dd58b7ccdfbde024cc0.pdf · 关于Kyligence • Kyligence’s vision is to unleash big data productivity

感谢聆听!

All rights reserved ©Kyligence Inc.http://kyligence.io

ApacheKylin

[email protected]

Twitter:@ApacheKylinhttp://kylin.apache.org

Kyligence Inc.

[email protected]

Twitter:@Kyligencehttp://kyligence.io

[email protected]