Download - 刘诚忠:Running cloudera impala on postgre sql
![Page 2: 刘诚忠:Running cloudera impala on postgre sql](https://reader034.vdocuments.pub/reader034/viewer/2022042501/55626369d8b42ae87d8b4e9c/html5/thumbnails/2.jpg)
Story coming from…
• Data gravity
• Why big data
• Why SQL on big data
![Page 3: 刘诚忠:Running cloudera impala on postgre sql](https://reader034.vdocuments.pub/reader034/viewer/2022042501/55626369d8b42ae87d8b4e9c/html5/thumbnails/3.jpg)
Today agenda
• Big data in Miaozhen 秒针系统
• Overview of Cloudera Impala
• Hacking practice in Cloudera Impala
• Performance
• Conclusions
• Q&A
![Page 4: 刘诚忠:Running cloudera impala on postgre sql](https://reader034.vdocuments.pub/reader034/viewer/2022042501/55626369d8b42ae87d8b4e9c/html5/thumbnails/4.jpg)
What happened in miaozhen
• 3 billion Ads impression per day
• 20TB data scan for report generation every morning
• 24 servers cluster
• Besides this – TV Monitor
– Mobile Monitor
– Site Monitor
– …
![Page 5: 刘诚忠:Running cloudera impala on postgre sql](https://reader034.vdocuments.pub/reader034/viewer/2022042501/55626369d8b42ae87d8b4e9c/html5/thumbnails/5.jpg)
Before Hadoop
• Scrat – PostgreSQL 9.1 cluster
– Write a simple proxy
– <2s for 2TB data scan
• Mobile Monitor – Hadoop-like distribute computing system
– Rabbit MQ + 3 computing servers
– Write a Map-Reduce in C++
– Handles 30 millions to 500 millions Ads impression
![Page 6: 刘诚忠:Running cloudera impala on postgre sql](https://reader034.vdocuments.pub/reader034/viewer/2022042501/55626369d8b42ae87d8b4e9c/html5/thumbnails/6.jpg)
Problem & Chance
• Database cluster
• SQL on Hadoop
• Miscellaneous data
• Requirements
– Most data is rational
– SQL interface
![Page 7: 刘诚忠:Running cloudera impala on postgre sql](https://reader034.vdocuments.pub/reader034/viewer/2022042501/55626369d8b42ae87d8b4e9c/html5/thumbnails/7.jpg)
SQL on Hadoop
• Google Dremel
• Apache Drill
• Cloudera Impala
• Facebook Presto
• EMC Greenplum/Pivotal
HDFS
Map Reduce
Hive Pig
Impala/Drill /Pivotal/Presto
Latency matters
![Page 8: 刘诚忠:Running cloudera impala on postgre sql](https://reader034.vdocuments.pub/reader034/viewer/2022042501/55626369d8b42ae87d8b4e9c/html5/thumbnails/8.jpg)
What’s this
• A kind of MPP engine
• In memory processing
• Small to big join
– Broadcast join
• Small result size
![Page 9: 刘诚忠:Running cloudera impala on postgre sql](https://reader034.vdocuments.pub/reader034/viewer/2022042501/55626369d8b42ae87d8b4e9c/html5/thumbnails/9.jpg)
Why Cloudera Impala
• The team move fast – UDF coming out – Better join strategy on the way
• Good code base – Modularize – Easy to add sub classes
• Really fast – Llvm code generation
• 80s/95s – uv test
– Distributed aggregation Tree – In-situ data processing (inside storage)
![Page 10: 刘诚忠:Running cloudera impala on postgre sql](https://reader034.vdocuments.pub/reader034/viewer/2022042501/55626369d8b42ae87d8b4e9c/html5/thumbnails/10.jpg)
Typical Arch. SQL Interface Meta Store
Query Planner
Coordinator
Exec Engine
Query Planner
Coordinator
Exec Engine
Query Planner
Coordinator
Exec Engine
![Page 11: 刘诚忠:Running cloudera impala on postgre sql](https://reader034.vdocuments.pub/reader034/viewer/2022042501/55626369d8b42ae87d8b4e9c/html5/thumbnails/11.jpg)
Our target
• A MPP database
– Build on PostgreSQL9.1
– Scale well
– Speed
• A mixed data source MPP query engine
– Join two tables in different sources
– In fact…
![Page 12: 刘诚忠:Running cloudera impala on postgre sql](https://reader034.vdocuments.pub/reader034/viewer/2022042501/55626369d8b42ae87d8b4e9c/html5/thumbnails/12.jpg)
Hacking… from where
• Add, not change
– Scan Node type
– DB Meta info
• Put changes in configuration
– Thrift Protocol update
• TDBHostInfo
• TDBScanNode
![Page 13: 刘诚忠:Running cloudera impala on postgre sql](https://reader034.vdocuments.pub/reader034/viewer/2022042501/55626369d8b42ae87d8b4e9c/html5/thumbnails/13.jpg)
Front end
• Meta store update
– Link data to the table name
– Table location management
• Front end
– Compute table location
![Page 14: 刘诚忠:Running cloudera impala on postgre sql](https://reader034.vdocuments.pub/reader034/viewer/2022042501/55626369d8b42ae87d8b4e9c/html5/thumbnails/14.jpg)
Back end
• Coordinator
– pg host
• New scan node type
– db scan node
• Pg scan node
• Psql library using cursor
![Page 15: 刘诚忠:Running cloudera impala on postgre sql](https://reader034.vdocuments.pub/reader034/viewer/2022042501/55626369d8b42ae87d8b4e9c/html5/thumbnails/15.jpg)
SQL Plan
Aggr.: sum(count(id)
Exchange node
Aggr. : group by id
Aggr. : count(id)
HDFS/PG scan
Aggr. : group by id
Exchange node
• select count(distinct id)
from table
– MR like process
![Page 16: 刘诚忠:Running cloudera impala on postgre sql](https://reader034.vdocuments.pub/reader034/viewer/2022042501/55626369d8b42ae87d8b4e9c/html5/thumbnails/16.jpg)
Env.
• Ads impression logs – 150 millions, 100KB/line
• 3 servers – 24 cores – 32 G mem – 2T * 12 HD – 100Mbps LAN
• Query – Select count(id) from t group by campaign – Select count(distinct id) from t group by campaign – Select * from t where id = ‘xxxxxxxx’
![Page 17: 刘诚忠:Running cloudera impala on postgre sql](https://reader034.vdocuments.pub/reader034/viewer/2022042501/55626369d8b42ae87d8b4e9c/html5/thumbnails/17.jpg)
Performance
impala
hive
pg+impala
• Group by speed / core
• 20 M /s
![Page 18: 刘诚忠:Running cloudera impala on postgre sql](https://reader034.vdocuments.pub/reader034/viewer/2022042501/55626369d8b42ae87d8b4e9c/html5/thumbnails/18.jpg)
With index
![Page 19: 刘诚忠:Running cloudera impala on postgre sql](https://reader034.vdocuments.pub/reader034/viewer/2022042501/55626369d8b42ae87d8b4e9c/html5/thumbnails/19.jpg)
Codegen on/off
en_codegen
dis_codegen
• select count(distinct id) from t group by c
• select distinct id
from t
• select id from t group by id having count(case when c = '1' then 1 else null end) > 0 and count(case when c= 2' then 1 else null end) > 0 limit 10;
![Page 20: 刘诚忠:Running cloudera impala on postgre sql](https://reader034.vdocuments.pub/reader034/viewer/2022042501/55626369d8b42ae87d8b4e9c/html5/thumbnails/20.jpg)
Multi-users
![Page 21: 刘诚忠:Running cloudera impala on postgre sql](https://reader034.vdocuments.pub/reader034/viewer/2022042501/55626369d8b42ae87d8b4e9c/html5/thumbnails/21.jpg)
Conclusion
• Source quality – Readable
– Google C++ style
– Robust
• MPP solution based on PG – Proved perf.
– Easy to scale
• Mixed engine usage – HDFS and DB
![Page 22: 刘诚忠:Running cloudera impala on postgre sql](https://reader034.vdocuments.pub/reader034/viewer/2022042501/55626369d8b42ae87d8b4e9c/html5/thumbnails/22.jpg)
What’s next
• Yarn integrating
• UDF
• Join with Big table
• BI roadmap
• Fail over
![Page 23: 刘诚忠:Running cloudera impala on postgre sql](https://reader034.vdocuments.pub/reader034/viewer/2022042501/55626369d8b42ae87d8b4e9c/html5/thumbnails/23.jpg)
Rerf.
• Cloudera Impala online doc. & src
• http://files.meetup.com/1727991/Impala%20and%20BigQuery.ppt
• http://www.cubrid.org/blog/dev-platform/meet-impala-open-source-real-time-sql-querying-on-hadoop/
• http://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/slides/Impala%20tech%20talk.pdf
• @datascientist, @dongxicheng, @flyingsk, @zhh
![Page 24: 刘诚忠:Running cloudera impala on postgre sql](https://reader034.vdocuments.pub/reader034/viewer/2022042501/55626369d8b42ae87d8b4e9c/html5/thumbnails/24.jpg)
Thanks! Q & A