aws를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- aws 웨비나 시리즈 2015
TRANSCRIPT
심화 웨비나 시리즈 | 8 번째 강연 2015년 7월 9일 목요일 | 오후 2시
http://aws.amazon.com/ko
AWS를 활용한 첫 빅데이터 프로젝트 시작하기
김일호, Solutions Architect
이번 웨비나 에서 들으실 내용..
이 강연에서는 AWS Elastic MapReduce, Amazon Redshift,
Amazon Kinesis 등 AWS가 제공하는 다양한 데이터 분석 도구를 활용해 보다 간편하고 빠른 빅데이터 분석 서비스를 구축하는 방법에 대해 소개합니다.
v
Agenda • AWS Big data building blocks
• AWS Big data platform
• Log data collection & storage
• Introducing Amazon Kinesis
• Data Analytics & Computation • Collaboration & sharing
• Netflix Use-case
AWS Big data building blocks (brief)
Use the right tools
Amazon S3
Amazon Kinesis
Amazon DynamoDB
Amazon Redshift
Amazon Elastic
MapReduce
Store anything Object storage Scalable 99.999999999% durability
Amazon S3
Real-time processing High throughput; elastic Easy to use EMR, S3, Redshift, DynamoDB Integrations
Amazon Kinesis
NoSQL Database Seamless scalability Zero admin Single digit millisecond latency
Amazon DynamoDB
Relational data warehouse Massively parallel Petabyte scale Fully managed $1,000/TB/Year
Amazon Redshift
Hadoop/HDFS clusters Hive, Pig, Impala, Hbase Easy to use; fully managed On-demand and spot pricing Tight integration with S3, DynamoDB, and Kinesis
Amazon Elastic
MapReduce
HDFS
Amazon RedShift
Amazon RDS
Amazon S3 Amazon DynamoDB
Amazon EMR
Amazon Kinesis
AWS Data Pipeline
Data management Hadoop Ecosystem analy8cal tools
Data Sources
AWS Data Pipeline
v
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
v
a
Amazon DynamoDB
Amazon RDS
Amazon Redshift
AWS Direct Connect
AWS Storage Gateway
AWS Import/ Export
Amazon Glacier S3 Amazon
Kinesis Amazon EMR
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
v
Amazon EC2 Amazon EMR Amazon Kinesis
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
v
Amazon Redshift
Amazon DynamoDB
Amazon RDS
S3 Amazon EC2 Amazon EMR
Amazon CloudFront
AWS CloudForma8on
AWS Data Pipeline
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
The right tools. At the right scale. At the right time.
AWS Big data platform
v
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
v
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
v
Collection of Data
Sources Aggrega8on Tool Data Sink
Web Servers Applica8on servers Connected Devices Mobile Phones
Etc
Scalable method to collect and aggregate Flume, KaGa, Kinesis,
Queue
Reliable and durable des8na8on OR Des8na8ons
Types of Data Ingest • Transactional
– Database reads/writes
• File – Click-stream logs
• Stream – Click-stream logs
Database
Cloud Storage
Stream Storage
Run your own log collector
Your applica0on Amazon S3
DynamoDB
Any other data store
Amazon S3
Amazon EC2
Use a Queue
Amazon Simple Queue Service (SQS)
Amazon S3
DynamoDB
Any other data store
Agency Customer: Video Analytics on AWS
Elastic Load Balancer
Edge Servers on EC2
Workers on EC2
Logs Reports
HDFS Cluster
Amazon Simple Queue Service (SQS)
Amazon Simple Storage Service (S3)
Amazon Elastic MapReduce
Use a Tool like FLUME, KAFKA, HONU etc
Flume running on EC2
Amazon S3
Any other data store
HDFS
v
Choice of tools
• (+) Pros / (-) Cons • (+) Flexibility: Customers select the most appropriate software and underlying
infrastructure • (+) Control: Software and hardware can be tuned to meet specific business and
scenario needs. • (-) Ongoing Operational Complexity: Deploy, and manage an end-to-end system • (-) Infrastructure planning and maintenance: Managing a reliable, scalable
infrastructure • (-) Developer/ IT staff overhead: Developers, Devops and IT staff time and
energy expended • (-) Unsupported Software: deprecated and/ pre-version 1 open source software • Future – Need for to stream data for real time
Stream Storage
Database
Cloud Storage
29
Why Stream Storage? • Convert multiple streams into fewer persistent sequential streams
• Sequential streams are easier to process
Amazon Kinesis or KaGa 4 4 3 3 2 2 1 14 3 2 1
4 3 2 1
4 3 2 1
4 3 2 1
4 4 3 3 2 2 1 1
Shard or Par88on 1
Shard or Par88on 2
Producer 1
Producer 2
Producer 3
Producer N
30
Amazon Kinesis or KaGa
Why Stream Storage? • Decouple producers and consumers
• Buffer
• Preserve client ordering • Streaming MapReduce • Consumer replay / reprocess
4 4 3 3 2 2 1 14 3 2 1
4 3 2 1
4 3 2 1
4 3 2 1
4 4 3 3 2 2 1 1
Producer 1
Shard or Par88on 1
Shard or Par88on 2
Consumer 1 Count of Red = 4
Count of Violet = 4
Consumer 2 Count of Blue = 4
Count of Green = 4
Producer 2
Producer 3
Producer N
Introducing Amazon Kinesis
Data Sources
App.4
[Machine Learning]
AW
S En
dpoint
App.1
[Aggregate & De-‐Duplicate]
Data Sources
Data Sources
Data Sources
App.2
[Metric Extrac0on]
S3
DynamoDB
Redshift
App.3 [Sliding Window Analysis]
Data Sources
Availability Zone
Shard 1 Shard 2 Shard N
Availability Zone
Availability Zone
Introducing Amazon Kinesis Managed Service for Real-Time Processing of Big Data
EMR
Kinesis Architecture
Amazon Web Services
AZ AZ AZ
Durable, highly consistent storage replicates dataacross three data centers (availability zones)
Aggregate andarchive to S3
Millions ofsources producing100s of terabytes
per hour
FrontEnd
AuthenticationAuthorization
Ordered streamof events supportsmultiple readers
Real-timedashboardsand alarms
Machine learningalgorithms or
sliding windowanalytics
Aggregate analysisin Hadoop or a
data warehouse
Inexpensive: $0.028 per million puts
Putting data into Kinesis Managed Service for Ingesting Fast Moving Data • Streams are made of Shards ⁻ A Kinesis Stream is composed of mul8ple Shards
⁻ Each Shard ingests up to 1MB/sec of data, and up to 1000 TPS
⁻ Each Shard emits up to 2 MB/sec of data
⁻ All data is stored for 24 hours ⁻ You scale Kinesis streams by adding or removing Shards
• Simple PUT interface to store data in Kinesis ⁻ Producers use a PUT call to store data in a Stream
⁻ A Par00on Key is used to distribute the PUTs across Shards ⁻ A unique Sequence # is returned to the Producer upon a successful PUT call
Producer
Shard 1
Shard 2
Shard 3
Shard n
Shard 4
Producer
Producer
Producer
Producer
Producer
Producer
Producer
Producer
Kinesis
POST / HTTP/1.1
Host: kinesis.<region>.<domain>
x-amz-Date: <Date>
Authorization: AWS4-HMAC-SHA256 Credential=<Credential>, SignedHeaders=content-type;date;host;user-agent;x-amz-date;x-amz-target;x-amzn-requestid, Signature=<Signature>
User-Agent: <UserAgentString>
Content-Type: application/x-amz-json-1.1
Content-Length: <PayloadSizeBytes>
Connection: Keep-Alive
X-Amz-Target: Kinesis_20131202.PutRecord
{
"StreamName": "exampleStreamName",
"Data": "XzxkYXRhPl8x",
"PartitionKey": "partitionKey"
}
v
Shard 1
Shard 2
Shard 3
Shard n
Shard 4
KCL Worker 1
KCL Worker 2
EC2 Instance
KCL Worker 3
KCL Worker 4
EC2 Instance
KCL Worker n
EC2 Instance
Kinesis
Building Kinesis Apps Client library for fault-tolerant, at least-once, real-time processing • Key streaming application attributes:
• Be distributed, to handle multiple shards
• Be fault tolerant, to handle failures in hardware or software
• Scale up and down as the number of shards increase or decrease
• Kinesis Client Library (KCL) helps with distributed processing:
• Automatically starts a Kinesis Worker for each shard
• Simplifies reading from the stream by abstracting individual shards
• Increases / Decreases Kinesis Workers as # of shards changes
• Checkpoints to keep track of a Worker’s location in the stream
• Restarts Workers if they fail
• Use the KCL with Auto Scaling Groups
• Auto Scaling policies will restart EC2 instances if they fail
• Automatically add EC2 instances when load increases
• KCL will redistributes Workers to use the new EC2 instances
OR
• Use the Get APIs for raw reads of Kinesis data streams
37
Easy Administra0on Managed service for real-‐8me streaming data collec8on, processing and analysis. Simply create a new stream, set the desired level of capacity, and let the service handle the rest.
Real-‐0me Performance Perform con8nual processing on streaming big data. Processing latencies fall to a few seconds, compared with the minutes or hours associated with batch processing.
High Throughput. Elas0c Seamlessly scale to match your data throughput rate and volume. You can easily scale up to gigabytes per second. The service will scale up or down based on your opera8onal or business needs.
S3, EMR, Storm, RedshiY, & DynamoDB Integra0on
Reliably collect, process, and transform all of your data in real-‐8me & deliver to AWS data stores of choice, with Connectors for S3, Redshi], and DynamoDB.
Build Real-‐0me Applica0ons Client libraries that enable developers to design and operate real-‐8me streaming data processing applica8ons.
Low Cost Cost-‐efficient for workloads of any scale. You can get started by provisioning a small stream, and pay low hourly rates only for what you use.
Amazon Kinesis: Key Developer Benefits
Customers using Amazon Kinesis Mobile/ Social Gaming Digital Adver0sing Tech.
Deliver con8nuous/ real-‐8me delivery of game insight data by 100’s of game servers
Generate real-‐8me metrics, KPIs for online ad performance for adver8sers/ publishers
Custom-‐built solu8ons opera8onally complex to manage, & not scalable
Store + Forward fleet of log servers, and Hadoop based processing pipeline
• Delay with cri8cal business data delivery • Developer burden in building reliable, scalable pladorm for real-‐8me data inges8on/ processing
• Slow-‐down of real-‐8me customer insights
• Lost data with Store/ Forward layer • Opera8onal burden in managing reliable, scalable pladorm for real-‐8me data inges8on/ processing
• Batch-‐driven real-‐8me customer insights
Accelerate 8me to market of elas8c, real-‐8me applica8ons – while minimizing opera8onal overhead
Generate freshest analy8cs on adver8ser performance to op8mize marke8ng spend, and increase responsiveness to clients
Digital Ad. Tech Metering with Kinesis
Con0nuous Ad Metrics Extrac0on
Incremental Ad. Sta0s0cs Computa0on
Metering Record Archive
Ad Analy0cs Dashboard
v
Collection of Data
Sources Aggrega8on Tool Data Sink
Web Servers Applica8on servers Connected Devices Mobile Phones
Etc
Scalable method to collect and aggregate Flume, KaGa, Kinesis,
Queue
Reliable and durable des8na8on OR Des8na8ons
Cloud Database & Storage
Cloud Database and Storage Tier Anti-pattern
App/Web Tier
Client Tier
RDBMS
Database & Storage Tier = All-in-one?
Cloud Database and Storage Tier — Use the Right Tool for the Job!
App/Web Tier
Client Tier
Data Tier
Database & Storage Tier
Search
Hadoop/HDFS
Cache
Blob Store
SQL NoSQL
App/Web Tier
Client Tier
Database & Storage Tier
Amazon RDS Amazon
DynamoDB
Amazon Elas0Cache
Amazon S3 Amazon Glacier
Amazon CloudSearch
HDFS on Amazon EMR
Cloud Database and Storage Tier — Use the Right Tool for the Job!
v
What Database and Storage Should I Use?
• Data structure • Query complexity • Data characteristics: hot, warm, cold
Data Structure and Query Types vs Storage Technology
Structured – Simple Query NoSQL
Amazon DynamoDB Cache
Amazon ElastiCache
Structured – Complex Query SQL
Amazon RDS Search
Amazon CloudSearch
Unstructured – No Query Cloud Storage
Amazon S3 Amazon Glacier
Unstructured – Custom Query Hadoop/HDFS
Amazon Elastic MapReduce
Data Structure Com
plexity
Query Structure Complexity
What is the Temperature of Your Data?
Amazon RDS
Amazon S3
Request Rate High Low
Cost/GB High Low
Latency Low High
Data Volume Low High
Amazon Glacier
Amazon CloudSearch
Structure
Low
High
Amazon DynamoDB
Amazon Elas8Cache
HDFS
What Data Store Should I Use? Amazon Elas0Cache
Amazon DynamoDB
Amazon RDS
Amazon CloudSearch
Amazon EMR (HDFS)
Amazon S3 Amazon Glacier
Average latency
ms ms ms, sec ms,sec sec,min,hrs ms,sec,min (~ size)
hrs
Data volume GB GB–TBs (no limit)
GB–TB (3 TB Max)
GB–TB GB–PB (~nodes)
GB–PB (no limit)
GB–PB (no limit)
Item size B-‐KB KB (64 KB max)
KB (~rowsize)
KB (1 MB max)
MB-‐GB KB-‐GB (5 TB max)
GB (40 TB max)
Request rate Very High Very High High High Low – Very High
Low– Very High (no limit)
Very Low (no limit)
Storage cost $/GB/month
$$ ¢¢ ¢¢ $ ¢ ¢ ¢
Durability Low -‐ Moderate
Very High High High High Very High Very High
Hot Data Warm Data Cold Data
Decouple your storage and analysis engine 1. Single Version of Truth 2. Choice of multiple analytics Tools 3. Parallel execution from different teams 4. Lower cost
Learning
from Nealix
v
S3 as a “single source of truth”
Courtesy http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html
S3
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Kinesis
Choose depending upon design
v
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
v
Process
• Answering questions about data
• Questions • Analytics: Think SQL/data warehouse • Classification: Think sentiment analysis • Prediction: Think page-views prediction • Etc
v
Processing Frameworks
Generally come in two major types: • Batch processing • Stream processing • Interactive query
v
Batch Processing • Take large amount of cold data and ask questions
• Takes minutes or hours to get answers back
Example: Genera-ng hourly, daily, weekly reports
Process
v
Stream Processing (AKA Real Time) • Take small amount of hot data and ask questions
• Takes short amount of time to get your answer back
Example: 1min metrics
Processing Tools • Batch processing/analytic
– Amazon Redshift – Amazon EMR
• Hive/Tez, Pig, Spark, Impala, Spark, Presto, ….
• Stream processing – Apache Spark streaming – Apache Storm (+ Trident) – Amazon Kinesis client and
connector library
Amplab Big Data Benchmark Scan query Aggregate query Join query
hops://amplab.cs.berkeley.edu/benchmark/
v
What Batch Processing Technology Should I Use? RedshiY Impala Presto Spark Hive
Query Latency
Low Low Low Low -‐ Medium Medium -‐ High
Durability High High High High High
Data Volume
1.6PB Max ~Nodes ~Nodes ~Nodes ~Nodes
Managed Yes EMR bootstrap
EMR bootstrap
EMR bootstrap Yes (EMR)
Storage Na8ve HDFS HDFS/S3 HDFS/S3 HDFS/S3
# of BI Tools
High Medium High Low High
Query Latency (Low is beoer)
v
What Stream Processing Technology Should I Use? Spark Streaming Apache Storm +
Trident Kinesis Client Library
Scale/Throughput ~ Nodes ~ Nodes ~ Nodes
Data Volume ~ Nodes ~ Nodes ~ Nodes
Manageability Yes (EMR bootstrap) Do it yourself EC2 + Auto Scaling
Fault Tolerance Built-‐in Built-‐in KCL Check poin8ng
Programming languages
Java, Python, Scala Java, Scala, Clojure
Java, Python
Hadoop based Analysis Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggrega0on tools
Amazon EMR
Your choice of tools on Hadoop/EMR Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggrega0on tools
Amazon EMR
Hadoop based Analysis Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggrega0on tools
Amazon EMR
Hadoop based Analysis Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggrega0on tools
Amazon EMR
Spark and Shark
Cloudera Impala
Hadoop is good for 1. Ad Hoc Query analysis 2. Large Unstructured Data Sets 3. Machine Learning and Advanced Analytics 4. Schema less
SQL based Low Latency Analytics on structured data
SQL based processing Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggrega0on tools
Amazon Redshift
Petabyte scale Columnar Data -warehouse
SQL based processing for unstructured data Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggrega0on tools
Amazon EMR
Amazon Redshift
Pre-processing framework
Petabyte scale Columnar Data -warehouse
Your choice of BI Tools on the cloud Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggrega0on tools
Amazon EMR
Amazon Redshift
Pre-processing framework
v
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Collaboration and Sharing insights Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggrega0on tools
Amazon EMR
Amazon Redshift
Sharing results and visualizations Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggrega0on tools
Amazon EMR
Amazon Redshift
Web App Server
Visualization tools
Sharing results and visualizations and scale Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggrega0on tools
Amazon EMR
Amazon Redshift
Web App Server Visualization tools
Sharing results and visualizations Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggrega0on tools
Amazon EMR
Amazon Redshift Business
Intelligence Tools
Business Intelligence Tools
Geospatial Visualizations Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggrega0on tools
Amazon EMR
Amazon Redshift Business
Intelligence Tools
Business Intelligence Tools
GIS tools on hadoop
GIS tools
Visualization tools
Rinse and Repeat Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggrega0on tools
Amazon EMR
Amazon Redshift
Visualization tools
Business Intelligence Tools
Business Intelligence Tools
GIS tools on hadoop
GIS tools
Amazon data pipeline
The complete architecture Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggrega0on tools
Amazon EMR
Amazon Redshift
Visualization tools
Business Intelligence Tools
Business Intelligence Tools
GIS tools on hadoop
GIS tools
Amazon data pipeline
Reference : BDT403 Next Genera8on Big Data Pladorm @ Nedlix
Architecture
v
Big Data • 10+ PB DW on S3 • 1.2 PB read daily • 100 TB written daily • ~ 200 billion events daily
Cloud apps
Suro Ursula
Cassandra Aegisthus
Dimension data
Event Data
15 min
Daily
Amazon S3
SS tables
Data Pipelines
Amazon S3
Storage Compute Service Tools
@2013
Amazon S3 v2.0
Storage Compute Service Tools
@2014
온라인 자습 및 실습
다양한 온라인 강의 자료 및 실습을 통해 AWS에 대한 기초적인 사용법 및 활용 방법을 익히
실 수 있습니다.
강의식 교육
AWS 전문 강사가 진행하는 강의를 통해 AWS 클라우드로 고가용성,
비용 효율성을 갖춘 안전한 애플리케이션을 만드는 방법을 알아보세요. 아키텍쳐 설계 및 구현에 대한 다양한 오프라인 강의가 개설되어
있습니다.
인증 시험을 통해 클라우드에 대한 자신의 전문 지식 및 경험을 공인받고 개발 경력을 제시할 수 있습
니다.
AWS 공인 자격증
http://aws.amazon.com/ko/training
다양한 교육 프로그램
AWS 기초 웨비나 시리즈에 참여해 주셔서 감사합니다! 이번 웨비나가 여러분의 궁금증 해소에 도움이 되었길 바랍니다. 이후 이어질 설문 조사를 통해 오늘 웨비나에 대한 의견을 알려주세요. [email protected] http://twitter.com/AWSKorea http://facebook.com/AmazonWebServices.ko http://youtube.com/user/AWSKorea http://slideshare.net/AWSKorea