찾아가는 aws 세미나(구로,가산,판교) - aws 기반 빅데이터 활용 방법 (김일호...
TRANSCRIPT
BigdataonAWS김일호, SolutionsArchitect
09-Nov-2016
Agenda
• AWS Big data building blocks
• AWS Big data platform
– Log data collection & storage
– Introducing Amazon Kinesis
– Data Analytics & Computation
– Collaboration & sharing
AWS Big data building blocks (brief)
Use the right tools
Amazon S3
Amazon Kinesis
Amazon DynamoDB
Amazon Redshift
Amazon Elastic
MapReduce
Store anythingObject storage
Scalable99.999999999% durability
Amazon S3
Real-time processingHigh throughput; elastic
Easy to useEMR, S3, Redshift, DynamoDB Integrations
Amazon Kinesis
NoSQL DatabaseSeamless scalability
Zero adminSingle digit millisecond latency
Amazon DynamoDB
Relational data warehouseMassively parallel
Petabyte scaleFully managed$1,000/TB/Year
Amazon Redshift
Hadoop/HDFS clustersHive, Pig, Impala, Hbase
Easy to use; fully managedOn-demand and spot pricingTight integration with S3,
DynamoDB, and Kinesis
Amazon Elastic
MapReduce
HDFS
AmazonRedShift
AmazonRDS
Amazon S3 AmazonDynamoDB
Amazon EMR
AmazonKinesis
AWS Data Pipeline
Data management Hadoop Ecosystem analytical tools
Data Sources
AWS DataPipeline
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
AmazonDynamoDB
Amazon RDS
AmazonRedshift
AWS Direct Connect
AWS Storage Gateway
AWS Import/ Export
Amazon GlacierS3Amazon
Kinesis Amazon EMR
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Amazon EC2 Amazon EMRAmazon Kinesis
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
AmazonRedshift
AmazonDynamoDB
Amazon RDS
S3 Amazon EC2 Amazon EMR
Amazon CloudFront
AWS CloudFormation
AWSData Pipeline
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
The right tools. At the right scale. At the right time.
AWS Big data platform
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Collection of Data
Sources AggregationTool Data Sink
Web ServersApplication serversConnected Devices
Mobile PhonesEtc
Scalable method to collect and aggregateFlume, Kafka, Kinesis,
Queue
Reliable and durable destination OR Destinations
Types of Data Ingest
• Transactional– Database
reads/writes
• File– Click-stream logs
• Stream– Click-stream logs
Database
Cloud Storage
StreamStorage
Run your own log collector
Yourapplication Amazon S3
DynamoDB
Anyotherdatastore
Amazon S3
AmazonEC2
Use a Queue
AmazonSimpleQueueService(SQS)
Amazon S3
DynamoDB
Anyotherdatastore
Agency Customer: Video Analytics on AWS
Elastic LoadBalancer
Edge Servers on EC2
Workers onEC2
Logs Reports
HDFS Cluster
Amazon Simple Queue Service (SQS)
Amazon Simple Storage Service (S3)
Amazon Elastic MapReduce
Use a Tool like FLUME, KAFKA, HONU etc
Flume running on EC2
Amazon S3
Anyotherdatastore
HDFS
Stream Storage
Database
Cloud Storage
26
Why Stream Storage?Convert multiple streams into fewer persistent sequential streams
Sequential streams are easier to process
Amazon Kinesis or Kafka4 4 3 3 2 2 1 14 3 2 1
4 3 2 1
4 3 2 1
4 3 2 1
4 4 3 3 2 2 1 1
Shard or Partition 1
Shard or Partition 2
Producer 1
Producer 2
Producer 3
Producer N
27
Amazon Kinesis or Kafka
Why Stream Storage?Decouple producers and consumersBuffer
Preserve client orderingStreaming MapReduceConsumer replay / reprocess
4 4 3 3 2 2 1 14 3 2 1
4 3 2 1
4 3 2 1
4 3 2 1
4 4 3 3 2 2 1 1
Producer 1Shard or Partition 1
Shard or Partition 2
Consumer 1Count of Red = 4
Count of Violet = 4
Consumer 2Count of Blue = 4
Count of Green = 4
Producer 2
Producer 3
Producer N
Introducing Amazon Kinesis
DataSources
App.4
[MachineLearning]
AWSEn
dpoint
App.1
[Aggregate&De-Duplicate]
DataSources
DataSources
DataSources
App.2
[MetricExtraction]
S3
DynamoDB
Redshift
App.3[SlidingWindowAnalysis]
DataSources
Availability Zone
Shard 1Shard 2Shard N
Availability Zone
Availability Zone
Introducing Amazon Kinesis Managed Service for Real-Time Processing of Big Data
EMR
Kinesis Architecture
Amazon Web Services
AZ AZ AZ
Durable, highly consistent storage replicates dataacross three data centers (availability zones)
Aggregate andarchive to S3
Millions ofsources producing100s of terabytes
per hour
FrontEnd
AuthenticationAuthorization
Ordered streamof events supportsmultiple readers
Real-timedashboardsand alarms
Machine learningalgorithms or
sliding windowanalytics
Aggregate analysisin Hadoop or a
data warehouse
Inexpensive: $0.028 per million puts
Putting data into KinesisManaged Service for Ingesting Fast Moving Data• Streams are made of Shards⁻ A Kinesis Stream is composed of multiple Shards ⁻ Each Shard ingests up to 1MB/sec of data, and up to 1000 TPS⁻ Each Shard emits up to 2 MB/sec of data⁻ All data is stored for 24 hours⁻ You scale Kinesis streams by adding or removing Shards
• Simple PUT interface to store data in Kinesis⁻ Producers use a PUT call to store data in a Stream⁻ A Partition Key is used to distribute the PUTs across Shards⁻ A unique Sequence # is returned to the Producer upon a
successful PUT call
Producer
Shard 1
Shard 2
Shard 3
Shard n
Shard 4
Producer
Producer
Producer
Producer
Producer
Producer
Producer
Producer
Kinesis
Shard 1
Shard 2
Shard 3
Shard n
Shard 4
KCL Worker 1
KCL Worker 2
EC2 Instance
KCL Worker 3
KCL Worker 4
EC2 Instance
KCL Worker n
EC2 Instance
Kinesis
Building Kinesis AppsClient library for fault-tolerant, at least-once, real-time processing • Key streaming application attributes:
– Be distributed, to handle multiple shards
– Be fault tolerant, to handle failures in hardware or software
– Scale up and down as the number of shards increase or decrease
• Kinesis Client Library (KCL) helps with distributed processing:
– Automatically starts a Kinesis Worker for each shard
– Simplifies reading from the stream by abstracting individual shards
– Increases / Decreases Kinesis Workers as # of shards changes
– Checkpoints to keep track of a Worker’s location in the stream
– Restarts Workers if they fail
• Use the KCL with Auto Scaling Groups
– Automatically add EC2 instances when load increases
– KCL will redistributes Workers to use the new EC2 instances
34
EasyAdministration
Managedserviceforreal-timestreamingdatacollection,processingandanalysis.
Simplycreateanewstream,setthedesiredlevelofcapacity,andlettheservicehandle
therest.
Real-timePerformance
Performcontinualprocessingonstreamingbigdata.Processinglatenciesfalltoafewseconds,comparedwiththeminutesorhoursassociatedwithbatchprocessing.
HighThroughput.Elastic
Seamlesslyscaletomatchyourdatathroughputrateandvolume.Youcaneasily
scaleuptogigabytespersecond.Theservicewillscaleupordownbasedonyour
operationalorbusinessneeds.
S3,EMR,Storm, Redshift,&DynamoDBIntegration
Reliablycollect,process,andtransformallofyourdatainreal-time&delivertoAWSdata
storesofchoice,withConnectorsfor S3,Redshift,andDynamoDB.
BuildReal-timeApplications
Clientlibrariesthatenabledeveloperstodesignandoperatereal-timestreamingdata
processingapplications.
LowCost
Cost-efficientforworkloadsofanyscale.Youcangetstartedbyprovisioningasmall
stream,andpaylowhourlyratesonlyforwhatyouuse.
Amazon Kinesis: Key Developer Benefits
Customers using Amazon KinesisMobile/ Social Gaming Digital Advertising Tech.
Deliver continuous/ real-time delivery of game insight data by 100’s of game servers
Generate real-time metrics, KPIs for online ad performance for advertisers/ publishers
Custom-built solutions operationally complex to manage, & not scalable
Store + Forward fleet of log servers, and Hadoop based processing pipeline
• Delay with critical business data delivery• Developer burden in building reliable, scalable platform for real-time data ingestion/ processing
• Slow-down of real-time customer insights
• Lost data with Store/ Forward layer• Operational burden in managing reliable, scalable
platform for real-time data ingestion/ processing• Batch-driven real-time customer insights
Accelerate time to market of elastic, real-time applications – while minimizing operational
overhead
Generate freshest analytics on advertiser performance to optimize marketing spend, and increase
responsiveness to clients
Under NDA
Gaming Analytics with Amazon Kinesis
Digital Ad. Tech Metering with Kinesis
Continuous Ad Metrics Extraction
Incremental Ad. Statistics Computation
Metering Record Archive
Ad Analytics Dashboard
Amazon Kinesis Firehose
Collection of Data
Sources AggregationTool Data Sink
Web ServersApplication serversConnected Devices
Mobile PhonesEtc
Scalable method to collect and aggregateFlume, Kafka, Kinesis,
Queue
Reliable and durable destination OR Destinations
Cloud Database & Storage
Cloud Database and Storage Tier Anti-pattern
App/Web Tier
Client Tier
Database & Storage Tier = All-in-one?
Cloud Database and Storage Tier — Use the Right Tool for the Job!
App/Web Tier
Client Tier
Data TierDatabase & Storage Tier
Search
Hadoop/HDFS
Cache
Blob Store
SQL NoSQL
Database & Storage Tier
Amazon RDSAmazon DynamoDB
Amazon ElastiCache
Amazon S3Amazon Glacier
Amazon CloudSearch
HDFS on Amazon EMR
Cloud Database and Storage Tier — Use the Right Tool for the Job!
What Database and Storage Should I Use?
• Data structure• Query complexity• Data characteristics: hot, warm, cold
Data Structure and Query Types vs Storage Technology
Structured – Simple QueryNoSQL
Amazon DynamoDBCache
Amazon ElastiCache
Structured – Complex QuerySQL
Amazon RDS Search
Amazon CloudSearch
Unstructured – No QueryCloud Storage
Amazon S3Amazon Glacier
Unstructured – Custom QueryHadoop/HDFS
Amazon Elastic MapReduce
Dat
a St
ruct
ure
Com
plex
ity
Query Structure Complexity
What is the Temperature of Your Data?
AmazonRDS
Request RateHigh Low
Cost/GBHigh Low
LatencyLow High
Data VolumeLow High
AmazonGlacier
AmazonCloudSearch
Stru
ctur
eLow
High
AmazonDynamoD
B
AmazonElastiCach
e
What Data Store Should I Use?Amazon ElastiCache
AmazonDynamoDB
AmazonRDS
AmazonCloudSearch
Amazon EMR (HDFS)
Amazon S3 Amazon Glacier
Average latency
ms ms ms, sec ms,sec sec,min,hrs ms,sec,min(~ size)
hrs
Data volume GB GB–TBs(no limit)
GB–TB(3 TB Max)
GB–TB GB–PB(~nodes)
GB–PB(no limit)
GB–PB(no limit)
Item size B-KB KB(64 KB max)
KB(~rowsize)
KB(1 MB max)
MB-GB KB-GB(5 TB max)
GB(40 TB max)
Request rate Very High
Very High High High Low – Very High
Low–Very High(no limit)
Very Low(no limit)
Storage cost$/GB/month
$$ ¢¢ ¢¢ $ ¢ ¢ ¢
Durability Low -Moderate
Very High High High High Very High Very High
Hot Data Warm Data Cold Data
Decouple your storage and analysis engine1. Single Version of Truth2. Choice of multiple analytics Tools3. Parallel execution from different teams4. Lower cost
S3 as a “single source of truth”
Courtesy http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html
S3
AmazonSQS
Amazon S3
DynamoDB
AnySQLorNOSQLStore
Kinesis
Choose depending upon design
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Process
• Answering questions about data
• Questions– Analytics: Think SQL/data warehouse– Classification: Think sentiment analysis – Prediction: Think page-views prediction – Etc
Processing Frameworks
Generally come in two major types:• Batch processing• Stream processing• Interactive query
Batch Processing
• Take large amount of cold data and ask questions
• Takes minutes or hours to get answers back
Example: Generating hourly, daily, weekly reports
Process
Stream Processing (AKA Real Time)
• Take small amount of hot data and ask questions
• Takes short amount of time to get your answer back
Example: 1min metrics
Processing Tools
• Batch processing/analytic– Amazon Redshift– Amazon EMR
• Hive/Tez, Pig, Spark, Impala, Spark, Presto, ….
• Stream processing– Apache Spark streaming– Apache Storm (+ Trident)– Amazon Kinesis client and
connector library
Amplab Big Data Benchmark
Scan query Aggregate query Join queryhttps://amplab.cs.berkeley.edu/benchmark/
What Batch Processing Technology Should I Use?Redshift Impala Presto Spark Hive
Query Latency Low Low Low Low - Medium Medium - High
Durability High High High High High
Data Volume 1.6PB Max ~Nodes ~Nodes ~Nodes ~Nodes
Managed Yes EMR bootstrap EMRbootstrap
EMR bootstrap Yes (EMR)
Storage Native HDFS HDFS/S3 HDFS/S3 HDFS/S3
# of BI Tools High Medium High Low High
Query Latency (Low is better)
What Stream Processing Technology Should I Use?Spark Streaming Apache Storm +
TridentKinesis Client Library
Scale/Throughput ~ Nodes ~ Nodes ~ Nodes
Data Volume ~ Nodes ~ Nodes ~ Nodes
Manageability Yes (EMR bootstrap) Do it yourself EC2 + Auto Scaling
Fault Tolerance Built-in Built-in KCL Check pointing
Programming languages Java, Python, Scala Java, Scala, Clojure Java, Python
Amazon Kinesis Analytics
Hadoop based Analysis
AmazonSQS
Amazon S3
DynamoDB
AnySQLorNOSQLStore
LogAggregationtools
Amazon EMR
Your choice of tools on Hadoop/EMR
AmazonSQS
Amazon S3
DynamoDB
AnySQLorNOSQLStore
LogAggregationtools
Amazon EMR
Hadoop based Analysis
AmazonSQS
Amazon S3
DynamoDB
AnySQLorNOSQLStore
LogAggregationtools
Amazon EMR
Hadoop based Analysis
AmazonSQS
Amazon S3
DynamoDB
AnySQLorNOSQLStore
LogAggregationtools
Amazon EMR
Spark and Shark
Cloudera Impala
Hadoop is good for
1. Ad Hoc Query analysis2. Large Unstructured Data Sets 3. Machine Learning and Advanced Analytics4. Schema less
SQL based Low Latency Analytics on structured data
SQL based processing
AmazonSQS
Amazon S3
DynamoDB
AnySQLorNOSQLStore
LogAggregationtools
Amazon Redshift
Petabyte scale Columnar Data -warehouse
SQL based processing for unstructured data
AmazonSQS
Amazon S3
DynamoDB
AnySQLorNOSQLStore
LogAggregationtools
Amazon EMR
Amazon Redshift
Pre-processing framework
Petabyte scale Columnar Data -warehouse
Your choice of BI Tools on the cloud
AmazonSQS
Amazon S3
DynamoDB
AnySQLorNOSQLStore
LogAggregationtools
Amazon EMR
Amazon Redshift
Pre-processing framework
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Collaboration and Sharing insights
AmazonSQS
Amazon S3
DynamoDB
AnySQLorNOSQLStore
LogAggregationtools
Amazon EMR
Amazon Redshift
Sharing results and visualizations
AmazonSQS
Amazon S3
DynamoDB
AnySQLorNOSQLStore
LogAggregationtools
Amazon EMR
Amazon Redshift
Web App ServerVisualization tools
Sharing results and visualizations and scale
AmazonSQS
Amazon S3
DynamoDB
AnySQLorNOSQLStore
LogAggregationtools
Amazon EMR
Amazon Redshift
Web App ServerVisualization tools
Sharing results and visualizations
AmazonSQS
Amazon S3
DynamoDB
AnySQLorNOSQLStore
LogAggregationtools
Amazon EMR
Amazon Redshift Business
Intelligence Tools
Business Intelligence Tools
Geospatial Visualizations
AmazonSQS
Amazon S3
DynamoDB
AnySQLorNOSQLStore
LogAggregationtools
Amazon EMR
Amazon Redshift Business
Intelligence Tools
Business Intelligence Tools
GIS tools on hadoop
GIS tools
Visualization tools
Rinse and Repeat
AmazonSQS
Amazon S3
DynamoDB
AnySQLorNOSQLStore
LogAggregationtools
Amazon EMR
Amazon Redshift
Visualization tools
Business Intelligence Tools
Business Intelligence Tools
GIS tools on hadoop
GIS tools
Amazon data pipeline
The complete architecture
AmazonSQS
Amazon S3
DynamoDB
AnySQLorNOSQLStore
LogAggregationtools
Amazon EMR
Amazon Redshift
Visualization tools
Business Intelligence Tools
Business Intelligence Tools
GIS tools on hadoop
GIS tools
Amazon data pipeline
Expanding analytics architecture
Adding Amazon Kinesis Analytics, Amazon Machine Learning, and Amazon ElasticSearch
Amazon RedshiftAmazon Elastic MapReduce
Amazon Glacier
Amazon DynamoD
B
Amazon Machine Learning
Amazon Kinesis
Data WarehouseSemi-structured NoSQL Predictive Models
Other AppsStreaming
Amazon Simple Storage Service
Data Lake Archive
Log Generato
r
Creating summary tables from log table
Amazon Elasticsearch Service
AWSLambda
Amazon Kinesis
Analytics