cloudera impala + postgresql
Post on 14-Dec-2014
1.167 Views
Preview:
DESCRIPTION
TRANSCRIPT
Running Cloudera Impala on PostgreSQL
By Chengzhong Liuliuchengzhong@miaozhen.com
2013.12
Story coming from…
• Data gravity• Why big data• Why SQL on big data
Today agenda
• Big data in Miaozhen 秒针系统• Overview of Cloudera Impala• Hacking practice in Cloudera Impala• Performance• Conclusions• Q&A
What happened in miaozhen
• 3 billion Ads impression per day• 20TB data scan for report generation every morning• 24 servers cluster
• Besides this– TV Monitor– Mobile Monitor– Site Monitor– …
Before Hadoop
• Scrat– PostgreSQL 9.1 cluster– Write a simple proxy – <2s for 2TB data scan
• Mobile Monitor– Hadoop-like distribute computing system– Rabbit MQ + 3 computing servers– Write a Map-Reduce in C++– Handles 30 millions to 500 millions Ads impression
Problem & Chance
• Database cluster• SQL on Hadoop• Miscellaneous data
• Requirements– Most data is rational– SQL interface
SQL on Hadoop
• Google Dremel• Apache Drill• Cloudera Impala• Facebook Presto• EMC Greenplum/Pivotal
HDFS
Map Reduce
HivePig
Impala/Drill/Pivotal/Presto
Latency matters
What’s this
• A kind of MPP engine• In memory processing• Small to big join– Broadcast join
• Small result size
Why Cloudera Impala
• The team move fast– UDF coming out– Better join strategy on the way
• Good code base– Modularize– Easy to add sub classes
• Really fast– Llvm code generation
• 80s/95s – uv test
– Distributed aggregation Tree– In-situ data processing (inside storage)
Typical Arch.SQL Interface Meta Store
Query Planner
Coordinator
Exec Engine
Query Planner
Coordinator
Exec Engine
Query Planner
Coordinator
Exec Engine
Our target
• A MPP database– Build on PostgreSQL9.1– Scale well– Speed
• A mixed data source MPP query engine– Join two tables in different sources– In fact…
Hacking… from where
• Add, not change– Scan Node type– DB Meta info
• Put changes in configuration– Thrift Protocol update• TDBHostInfo• TDBScanNode
Front end
• Meta store update– Link data to the table name– Table location management
• Front end– Compute table location
Back end
• Coordinator– pg host
• New scan node type– db scan node• Pg scan node• Psql library using cursor
SQL Plan
Aggr.: sum(count(id)
Exchange node
Aggr. : group by id
Aggr. : count(id)
HDFS/PG scan
Aggr. : group by id
Exchange node
• select count(distinct id) from table
– MR like process
Env.
• Ads impression logs– 150 millions, 100KB/line
• 3 servers– 24 cores– 32 G mem– 2T * 12 HD– 100Mbps LAN
• Query– Select count(id) from t group by campaign– Select count(distinct id) from t group by campaign– Select * from t where id = ‘xxxxxxxx’
Performance
1 2 30
100
200
300
400
500
600
700
impalahivepg+impala
• Group by speed / core• 20 M /s
With index
Codegen on/off
uv_test distinct duplicated0
10
20
30
40
50
60
70
80
90
100
en_codegendis_codegen
• select count(distinct id) from t group by c
• select distinct idfrom t
• select id from tgroup by id having count(case when c = '1' then 1 else null end) > 0 and count(case when c= 2' then 1 else null end) > 0 limit 10;
Multi-users
Conclusion
• Source quality– Readable– Google C++ style– Robust
• MPP solution based on PG– Proved perf.– Easy to scale
• Mixed engine usage– HDFS and DB
What’s next
• Yarn integrating• UDF• Join with Big table• BI roadmap• Fail over
Rerf.
• Cloudera Impala online doc. & src• http://files.meetup.com/1727991/Impala%20
and%20BigQuery.ppt• http://www.cubrid.org/blog/dev-platform/me
et-impala-open-source-real-time-sql-querying-on-hadoop/
• http://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/slides/Impala%20tech%20talk.pdf
• @datascientist, @dongxicheng, @flyingsk, @zhh
Thanks!
Q & A
top related