如何利用 amazon emr 及athena 打造高成本效益的大數據環境
Post on 21-Jan-2018
704 Views
Preview:
TRANSCRIPT
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Dickson Yue, Solutions Architect
June 2nd, 2017
如何利用 Amazon EMR 及 Athena
打造高成本效益的大數據環境
議程
• 解構當前的大數據環境
• Amazon Athena, Amazon EMR應用
• 將組件遷移到Amazon Athena,技巧和竅門
• 將組件遷移到Amazon EMR,技巧和竅門
• 客戶實例
解構現有大數據環境
數據分析平台技術的發展
Data warehouse
appliances
1985 2006
Hadoop
clusters
2009
Decoupled EMR
clusters
2012
Cloud DWH
Redshift
Today
Clusterless
Athena Glue
Amazon SQS apps
Streaming
KCL
apps
Amazon Redshift
Amazon
Machine Learning
Presto
Amazon
EMR
Amazon Elasticsearch
Service
Apache Kafka
Amazon SQS
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
Amazon DynamoDB
Streams
Fa
st
Slo
wF
ast
Se
arc
h
SQ
L
No
SQ
L C
ac
he
Fil
eM
ess
ag
eS
tre
am
Amazon EC2
Amazon EC2
Mobile apps
Web apps
Devices
MessagingMessage
Sensors &
IoT platformsAWS IoT
Data centersAWS Direct
Connect
AWS Import/ExportSnowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
RECORDS
DOCUMENTS
FILES
MESSAGES
STREAMS
Amazon QuickSight
Apps & Services
An
aly
sis
& v
isu
ali
zati
on
No
teb
oo
ks
IDE
AP
I
Reference architecture
Lo
gg
ing
IoT
Ap
pli
ca
tio
ns
Tra
nsp
ort
Me
ssa
gin
gETL
Ba
tch
Me
ssa
ge
Inte
rac
tiv
eS
tre
am
ML
Amazon EMR
AWS Lambda
Amazon Kinesis
Analytics
Amazon Athena
儲存STORE
使用CONSUME
過程/分析PROCESS / ANALYZE
收集COLLECT
Use case 使用案例 Redshift Amazon
Athena
Amazon
EMR
Other
互動 Interactive需要秒示例:自助式儀表板, Self-service dashboards
Redshift
Athena
+
S3
Presto
Spark
+
S3
Amazon
Elasticsearch
service
RDS
批量 Batch需要幾分鐘到幾個小時示例:每日/每週/每月報告
MapReduce
Hive Pig
Spark
Glue
串流 Stream毫秒到秒示例:欺詐警報 Fraud alerts ,1分鐘指標
Spark
streaming
Kinesis Analytics
KCL
Storm
Lambda
機器學習 Machine Learning需要毫秒到幾分鐘示例:欺詐檢測 Fraud detection, ,預測需求
Forecast demand
Spark ML
Amazon Machine
Learning
Deep learning
AMI
Slow
將工作遷移到 Amazon Athena
直接從Amazon S3作數據查詢
• 無需儲入數據
• 以原始格式查詢數據• Athena支持多種數據格式
• Text,CSV,TSV,JSON,weblogs,AWS service logs
• 或者轉換為優化形式,如ORC或Parquet,以獲得最佳性能和最低成本
• 不需要ETL
• 直接將數據流入Amazon S3
• 利用Amazon S3的耐用性和可用性
例子
例子
Ad-hoc access to raw data using SQL
例子
Ad-hoc access to data using AthenaAthena can query
aggregated datasets as well
技巧和竅門
按查詢付款 - 掃描$ 5 / TB
• 支付每個查詢掃描的數據量
• 節省成本的方法• 壓縮
• 轉換為Columnar格式
• 使用 partitioning
• 免費:DDL查詢,失敗的查詢
Dataset Size on Amazon S3 Query Run time Data Scanned Cost
Logs stored as Text
files
1 TB 237 seconds 1.15TB $5.75
Logs stored in
Apache Parquet
format*
130 GB 5.13 seconds 2.69 GB $0.013
Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
轉換為ORC和PARQUET
• 您可以使用Hive CTAS轉換數據• CREATE TABLE new_key_value_store
• STORED AS PARQUET
• As
• SELECT col_1,col2,col3 FROM noncolumartable
• SORT BY new_key,key_value_pair;
• 您也可以使用Spark將文件轉換為PARQUET / ORC
• 20行Pyspark代碼,將1TB的文本數據轉換為130 GB的PARQUET在EMR上運行
• 快速轉換總成本$ 5
https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion
如何定義你的 partitions
CREATE EXTERNAL TABLE Employee (
Id INT,
Name STRING,
Address STRING
) PARTITIONED BY (year INT)
ROW FORMAT DELIMITED FIELDS
TERMINATED BY ','
LOCATION
‘s3://mybucket/athena/inputdata/’;
CREATE EXTERNAL TABLE Employee (
Id INT,
Name STRING,
Address STRING,
INT Year
) PARTITIONED BY (year INT)
ROW FORMAT DELIMITED FIELDS
TERMINATED BY ','
LOCATION
‘s3://mybucket/athena/inputdata/’;
如何定義你的 partitions
s3://elasticmapreduce/impressions/
PRE dt=2009-04-12-13-00/
PRE dt=2009-04-12-13-05/
PRE dt=2009-04-12-13-10/
PRE dt=2009-04-12-13-15/
PRE dt=2009-04-12-13-20/
CREATE EXTERNAL TABLE impressions ( requestBeginTime string, ......) PARTITIONED BY (dt string) LOCATION's3://elasticmapreduce/samples/hive-ads/tables/impressions/' ;
PRE dt=2009-04-12-14-10/
MSCK REPAIR TABLE impressions
如何定義你的 partitions
s3://athena-examples/elb/plaintext/
elb/plaintext/2015/01/01/part-r-00000-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
elb/plaintext/2015/01/01/part-r-00001-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
elb/plaintext/2015/01/01_$folder$
elb/plaintext/2015/01/02/part-r-00006-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
elb/plaintext/2015/01/02/part-r-00007-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
elb/plaintext/2015/01/02/_$folder$
ALTER TABLE elb_logs_raw_native_part
ADD PARTITION (year='2015',month='01',day='01')
location 's3://athena-examples/elb/plaintext/2015/01/01/'
ALTER TABLE elb_logs_raw_native_part
ADD PARTITION (year='2015',month='01',day='02')
location 's3://athena-examples/elb/plaintext/2015/01/02/'
將工作遷移到 Amazon EMR
挑戰
在地Hadoop叢集
• 1U機組
• 通常為12內核,32/64 GB RAM和6
- 8 TB硬盤($ 3-4K)
• 不同的node角色
• HDFS使用在地磁盤,容量大小需付合3x數據複製
• 網路交換器和機架
• 開放源碼版本或固定商業發行的許可條款
Server rack 1
(20 nodes)
Server rack 2
(20 nodes)
Server rack N
(20 nodes)
Core
在同一個叢集上運行的工作類型
• Large Scale ETL: Apache Spark, Apache Hive with Apache Tez or
Apache Hadoop MapReduce
• Interactive Queries: Apache Impala, Spark SQL, Presto, Apache
Phoenix
• Machine Learning and Data Science: Spark ML, Apache Mahout
• NoSQL: Apache HBase
• Stream Processing: Apache Kafka, Spark Streaming, Apache Flink,
Apache NiFi, Apache Storm
• Search: Elasticsearch, Apache Solr
• Job Submission: Client Edge Node, Apache Oozie
• Data warehouses like Pivotal Greenplum or Teradata
生產線
Over utilized Under utilized
技巧和竅門
遷移的關鍵和TCO考慮
• DO NOT LIFT AND SHIFT
• 透過S3,張存儲和計算分開
• 解構工作負載並映射到開源工具
• 短暫的群集和自動縮放
• 選擇實例類型和EC2 Spot實例
分拆運算和存儲,使用S3去作為您的數據層
HDFS
S3 is designed for 11
9’s of durability and is
massively scalable
EC2 Instance
Memory
Amazon S3
Amazon EMR
Amazon EMR
Intermediates
stored on local
disk or HDFSLocal
在S3上運行Hbase作為可擴展NoSQL
S3提示:分區,壓縮和文件格式
• 避免按字典順序排列鍵盤名稱
• 提高吞吐量和S3列表性能
• 使用散列/隨機前綴或反轉日期時間
• 壓縮數據集,將帶寬從S3減小到EC2
• 確保使用可拆分壓縮或將每個文件作為集群上並行化的最佳大小
• 像Parquet這樣的列狀文件格式可以提高讀取性能
多個Storage layer可供選擇
Amazon DynamoDB
Amazon RDS Amazon Kinesis
Amazon Redshift
Amazon S3
Amazon EMR
TCO –短暫或長時間運行的集群
提交作業的選項
Amazon EMR
Step API
Submit a Spark
application
Amazon EMR
AWS Data Pipeline
Airflow, Luigi, or other
schedulers on EC2
Create a pipeline
to schedule job
submission or create
complex workflows
AWS Lambda
Use AWS Lambda to
submit applications to
EMR Step API or directly
to Spark on your cluster
Use Oozie on your
cluster to build
DAGs of jobs
集群界面可快速調整工作負載
管理應用程序
SQL editor, Workflow designer,
Metastore browser
Notebooks
設計和執行查詢和工作負載
性能和硬件
• 短暫或長時間運行
• 實例類型
• 群集大小
• 應用程序設置
• 文件格式和S3調優
Master Node
r3.2xlarge
Slave Group - Core
c4.2xlarge
Slave Group – Task
m4.2xlarge (EC2 Spot)
注意事項
Spot for task nodes
Up to 80% off EC2
On-Demand pricing
On-demand for core nodes
Standard Amazon EC2
pricing for on-demand
capacity
使用Spot和Reserved instance降低成本
以可預見的成本滿足SLA 以較低的成本超出SLA
使用 Advanced spot
Master Node Core Instance Fleet Task Instance Fleet
• 選擇提供Spot和On-Demand的instance type
• 根據容量/價格,從而在最優的可用區域啟動
• Spot Block支持
用Auto scale降低成本
客戶實例
DataXu – 180TB of Log Data per Day
CDN
Real Time
Bidding
Retargeting
Platform
Kinesis Attribution & ML S3
Reporting
Data Visualization
Data
Pipeline
ETL(Spark SQL)
Ecosystem of tools and services
Amazon Athena
Petabytes of data generated
on-premises, brought to AWS,
and stored in S3
Thousands of analytical
queries performed on EMR
and Amazon Redshift.
Stringent security requirements
met by leveraging VPC, VPN,
encryption at-rest and in-
transit, CloudTrail, and
database auditing
Flexible
Interactive
Queries
Predefined
Queries
Surveillance
Analytics
Data Management
Data Movement
Data Registration
Version Management
Amazon S3
Web Applications
Analysts; Regulators
FINRA: Migrating from on-prem to AWS
FINRA saved 60% by moving to HBase on EMR
Lower Cost and Higher Scale than On-Premises
總結:跟據使用實例,選擇正確的工具
Storage S3 (EMRFS), HDFS
YARNCluster Resource Management
BatchMapReduce
InteractiveTez
In MemorySpark
ApplicationsHive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop
HB
ase
/ Ph
oe
nix
Pre
sto
Athena
StreamingFlink
- 低延遲SQL - >Athena,Presto或Amazon Redshift- 數據倉庫/報表 - > Spark或Hive或Glue或Amazon Redshift- 管理和監控 - > EMR控制台或Ganglia指標- HDFS - > S3- 筆記本 - > Zeppelin筆記本或Jupyter(通過bootstrap動作)- 查詢控制台 - >Athena或Redshift Spectrum色相- 安全 - >Ranger(CF template)或HiveServer2或IAM角色
Glue
Amazon Redshift
Athena
壓縮
轉換為Columnar格式
使用 Partitioning
總結
Amazon EMR
DO NOT LIFT AND SHIFT
用 S3 張存儲和計算分開
短暫運行
Spot fleet instances
Autoscaling
謝謝
dyue@amazon.com
aws.amzon.com/emr
blogs.aws.amazon.com/bigdata
top related