찾아가는 aws 세미나(구로,가산,판교) - aws 기반 빅데이터 활용 방법 (김일호...

BigdataonAWS김일호, SolutionsArchitect

09-Nov-2016

Agenda

• AWS Big data building blocks

• AWS Big data platform

– Log data collection & storage

– Introducing Amazon Kinesis

– Data Analytics & Computation

– Collaboration & sharing

AWS Big data building blocks (brief)

Use the right tools

Amazon S3

Amazon Kinesis

Amazon DynamoDB

Amazon Redshift

Amazon Elastic

MapReduce

Store anythingObject storage

Scalable99.999999999% durability

Amazon S3

Real-time processingHigh throughput; elastic

Easy to useEMR, S3, Redshift, DynamoDB Integrations

Amazon Kinesis

NoSQL DatabaseSeamless scalability

Zero adminSingle digit millisecond latency

Amazon DynamoDB

Relational data warehouseMassively parallel

Petabyte scaleFully managed$1,000/TB/Year

Amazon Redshift

Hadoop/HDFS clustersHive, Pig, Impala, Hbase

Easy to use; fully managedOn-demand and spot pricingTight integration with S3,

DynamoDB, and Kinesis

Amazon Elastic

MapReduce

HDFS

AmazonRedShift

AmazonRDS

Amazon S3 AmazonDynamoDB

Amazon EMR

AmazonKinesis

AWS Data Pipeline

Data management Hadoop Ecosystem analytical tools

Data Sources

AWS DataPipeline

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

AmazonDynamoDB

Amazon RDS

AmazonRedshift

AWS Direct Connect

AWS Storage Gateway

AWS Import/ Export

Amazon GlacierS3Amazon

Kinesis Amazon EMR

Generation




Amazon EC2 Amazon EMRAmazon Kinesis

Generation




AmazonRedshift

AmazonDynamoDB

Amazon RDS

S3 Amazon EC2 Amazon EMR

Amazon CloudFront

AWS CloudFormation

AWSData Pipeline

Generation




The right tools. At the right scale. At the right time.

AWS Big data platform

Generation




Collection of Data

Sources AggregationTool Data Sink

Web ServersApplication serversConnected Devices

Mobile PhonesEtc

Scalable method to collect and aggregateFlume, Kafka, Kinesis,

Queue

Reliable and durable destination OR Destinations

Types of Data Ingest

• Transactional– Database

reads/writes

• File– Click-stream logs

• Stream– Click-stream logs

Database

Cloud Storage

StreamStorage

Run your own log collector

Yourapplication Amazon S3

DynamoDB

Anyotherdatastore

Amazon S3

AmazonEC2

Use a Queue

AmazonSimpleQueueService(SQS)

Amazon S3

DynamoDB

Anyotherdatastore

Agency Customer: Video Analytics on AWS

Elastic LoadBalancer

Edge Servers on EC2

Workers onEC2

Logs Reports

HDFS Cluster

Amazon Simple Queue Service (SQS)

Amazon Simple Storage Service (S3)

Amazon Elastic MapReduce

Use a Tool like FLUME, KAFKA, HONU etc

Flume running on EC2

Amazon S3

Anyotherdatastore

HDFS

Stream Storage

Database

Cloud Storage

26

Why Stream Storage?Convert multiple streams into fewer persistent sequential streams

Sequential streams are easier to process

Amazon Kinesis or Kafka4 4 3 3 2 2 1 14 3 2 1

4 3 2 1

4 3 2 1

4 3 2 1

4 4 3 3 2 2 1 1

Shard or Partition 1


Producer 1

Producer 2

Producer 3

Producer N

27

Amazon Kinesis or Kafka

Why Stream Storage?Decouple producers and consumersBuffer

Preserve client orderingStreaming MapReduceConsumer replay / reprocess

4 4 3 3 2 2 1 14 3 2 1

4 3 2 1

4 3 2 1

4 3 2 1

4 4 3 3 2 2 1 1

Producer 1Shard or Partition 1


Consumer 1Count of Red = 4

Count of Violet = 4

Consumer 2Count of Blue = 4

Count of Green = 4

Producer 2

Producer 3

Producer N

Introducing Amazon Kinesis

DataSources

App.4

[MachineLearning]

AWSEn

dpoint

App.1

[Aggregate&De-Duplicate]

DataSources

DataSources

DataSources

App.2

[MetricExtraction]

S3

DynamoDB

Redshift

App.3[SlidingWindowAnalysis]

DataSources

Availability Zone

Shard 1Shard 2Shard N

Availability Zone

Availability Zone

Introducing Amazon Kinesis Managed Service for Real-Time Processing of Big Data

EMR

Kinesis Architecture

Amazon Web Services

AZ AZ AZ

Durable, highly consistent storage replicates dataacross three data centers (availability zones)

Aggregate andarchive to S3

Millions ofsources producing100s of terabytes

per hour

FrontEnd

AuthenticationAuthorization

Ordered streamof events supportsmultiple readers

Real-timedashboardsand alarms

Machine learningalgorithms or

sliding windowanalytics

Aggregate analysisin Hadoop or a

data warehouse

Inexpensive: $0.028 per million puts

Putting data into KinesisManaged Service for Ingesting Fast Moving Data• Streams are made of Shards⁻ A Kinesis Stream is composed of multiple Shards ⁻ Each Shard ingests up to 1MB/sec of data, and up to 1000 TPS⁻ Each Shard emits up to 2 MB/sec of data⁻ All data is stored for 24 hours⁻ You scale Kinesis streams by adding or removing Shards

• Simple PUT interface to store data in Kinesis⁻ Producers use a PUT call to store data in a Stream⁻ A Partition Key is used to distribute the PUTs across Shards⁻ A unique Sequence # is returned to the Producer upon a

successful PUT call

Producer

Shard 1

Shard 2

Shard 3

Shard n

Shard 4

Producer

Producer

Producer

Producer

Producer

Producer

Producer

Producer

Kinesis

Shard 1

Shard 2

Shard 3

Shard n

Shard 4

KCL Worker 1

KCL Worker 2

EC2 Instance

KCL Worker 3

KCL Worker 4

EC2 Instance

KCL Worker n

EC2 Instance

Kinesis

Building Kinesis AppsClient library for fault-tolerant, at least-once, real-time processing • Key streaming application attributes:

– Be distributed, to handle multiple shards

– Be fault tolerant, to handle failures in hardware or software

– Scale up and down as the number of shards increase or decrease

• Kinesis Client Library (KCL) helps with distributed processing:

– Automatically starts a Kinesis Worker for each shard

– Simplifies reading from the stream by abstracting individual shards

– Increases / Decreases Kinesis Workers as # of shards changes

– Checkpoints to keep track of a Worker’s location in the stream

– Restarts Workers if they fail

• Use the KCL with Auto Scaling Groups

– Automatically add EC2 instances when load increases

– KCL will redistributes Workers to use the new EC2 instances

34

EasyAdministration

Managedserviceforreal-timestreamingdatacollection,processingandanalysis.

Simplycreateanewstream,setthedesiredlevelofcapacity,andlettheservicehandle

therest.

Real-timePerformance

Performcontinualprocessingonstreamingbigdata.Processinglatenciesfalltoafewseconds,comparedwiththeminutesorhoursassociatedwithbatchprocessing.

HighThroughput.Elastic

Seamlesslyscaletomatchyourdatathroughputrateandvolume.Youcaneasily

scaleuptogigabytespersecond.Theservicewillscaleupordownbasedonyour

operationalorbusinessneeds.

S3,EMR,Storm, Redshift,&DynamoDBIntegration

Reliablycollect,process,andtransformallofyourdatainreal-time&delivertoAWSdata

storesofchoice,withConnectorsfor S3,Redshift,andDynamoDB.

BuildReal-timeApplications

Clientlibrariesthatenabledeveloperstodesignandoperatereal-timestreamingdata

processingapplications.

LowCost

Cost-efficientforworkloadsofanyscale.Youcangetstartedbyprovisioningasmall

stream,andpaylowhourlyratesonlyforwhatyouuse.

Amazon Kinesis: Key Developer Benefits

Customers using Amazon KinesisMobile/ Social Gaming Digital Advertising Tech.

Deliver continuous/ real-time delivery of game insight data by 100’s of game servers

Generate real-time metrics, KPIs for online ad performance for advertisers/ publishers

Custom-built solutions operationally complex to manage, & not scalable

Store + Forward fleet of log servers, and Hadoop based processing pipeline

• Delay with critical business data delivery• Developer burden in building reliable, scalable platform for real-time data ingestion/ processing

• Slow-down of real-time customer insights

• Lost data with Store/ Forward layer• Operational burden in managing reliable, scalable

platform for real-time data ingestion/ processing• Batch-driven real-time customer insights

Accelerate time to market of elastic, real-time applications – while minimizing operational

overhead

Generate freshest analytics on advertiser performance to optimize marketing spend, and increase

responsiveness to clients

Under NDA

Gaming Analytics with Amazon Kinesis

Digital Ad. Tech Metering with Kinesis

Continuous Ad Metrics Extraction

Incremental Ad. Statistics Computation

Metering Record Archive

Ad Analytics Dashboard

Amazon Kinesis Firehose

Collection of Data

Sources AggregationTool Data Sink

Web ServersApplication serversConnected Devices

Mobile PhonesEtc

Scalable method to collect and aggregateFlume, Kafka, Kinesis,

Queue

Reliable and durable destination OR Destinations

Cloud Database & Storage

Cloud Database and Storage Tier Anti-pattern

App/Web Tier

Client Tier

Database & Storage Tier = All-in-one?

Cloud Database and Storage Tier — Use the Right Tool for the Job!

App/Web Tier

Client Tier

Data TierDatabase & Storage Tier

Search

Hadoop/HDFS

Cache

Blob Store

SQL NoSQL

Database & Storage Tier

Amazon RDSAmazon DynamoDB

Amazon ElastiCache

Amazon S3Amazon Glacier

Amazon CloudSearch

HDFS on Amazon EMR

Cloud Database and Storage Tier — Use the Right Tool for the Job!

What Database and Storage Should I Use?

• Data structure• Query complexity• Data characteristics: hot, warm, cold

Data Structure and Query Types vs Storage Technology

Structured – Simple QueryNoSQL

Amazon DynamoDBCache

Amazon ElastiCache

Structured – Complex QuerySQL

Amazon RDS Search

Amazon CloudSearch

Unstructured – No QueryCloud Storage

Amazon S3Amazon Glacier

Unstructured – Custom QueryHadoop/HDFS

Amazon Elastic MapReduce

Dat

a St

ruct

ure

Com

plex

ity

Query Structure Complexity

What is the Temperature of Your Data?

AmazonRDS

Request RateHigh Low

Cost/GBHigh Low

LatencyLow High

Data VolumeLow High

AmazonGlacier

AmazonCloudSearch

Stru

ctur

eLow

High

AmazonDynamoD

B

AmazonElastiCach

e

What Data Store Should I Use?Amazon ElastiCache

AmazonDynamoDB

AmazonRDS

AmazonCloudSearch

Amazon EMR (HDFS)

Amazon S3 Amazon Glacier

Average latency

ms ms ms, sec ms,sec sec,min,hrs ms,sec,min(~ size)

hrs

Data volume GB GB–TBs(no limit)

GB–TB(3 TB Max)

GB–TB GB–PB(~nodes)

GB–PB(no limit)

GB–PB(no limit)

Item size B-KB KB(64 KB max)

KB(~rowsize)

KB(1 MB max)

MB-GB KB-GB(5 TB max)

GB(40 TB max)

Request rate Very High

Very High High High Low – Very High

Low–Very High(no limit)

Very Low(no limit)

Storage cost$/GB/month

$$ ¢¢ ¢¢ $ ¢ ¢ ¢

Durability Low -Moderate

Very High High High High Very High Very High

Hot Data Warm Data Cold Data

Decouple your storage and analysis engine1. Single Version of Truth2. Choice of multiple analytics Tools3. Parallel execution from different teams4. Lower cost

S3 as a “single source of truth”

Courtesy http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html

S3

AmazonSQS

Amazon S3

DynamoDB

AnySQLorNOSQLStore

Kinesis

Choose depending upon design

Generation




Process

• Answering questions about data

• Questions– Analytics: Think SQL/data warehouse– Classification: Think sentiment analysis – Prediction: Think page-views prediction – Etc

Processing Frameworks

Generally come in two major types:• Batch processing• Stream processing• Interactive query

Batch Processing

• Take large amount of cold data and ask questions

• Takes minutes or hours to get answers back

Example: Generating hourly, daily, weekly reports

Process

Stream Processing (AKA Real Time)

• Take small amount of hot data and ask questions

• Takes short amount of time to get your answer back

Example: 1min metrics

Processing Tools

• Batch processing/analytic– Amazon Redshift– Amazon EMR

• Hive/Tez, Pig, Spark, Impala, Spark, Presto, ….

• Stream processing– Apache Spark streaming– Apache Storm (+ Trident)– Amazon Kinesis client and

connector library

Amplab Big Data Benchmark

Scan query Aggregate query Join queryhttps://amplab.cs.berkeley.edu/benchmark/

What Batch Processing Technology Should I Use?Redshift Impala Presto Spark Hive

Query Latency Low Low Low Low - Medium Medium - High

Durability High High High High High

Data Volume 1.6PB Max ~Nodes ~Nodes ~Nodes ~Nodes

Managed Yes EMR bootstrap EMRbootstrap

EMR bootstrap Yes (EMR)

Storage Native HDFS HDFS/S3 HDFS/S3 HDFS/S3

# of BI Tools High Medium High Low High

Query Latency (Low is better)

What Stream Processing Technology Should I Use?Spark Streaming Apache Storm +

TridentKinesis Client Library

Scale/Throughput ~ Nodes ~ Nodes ~ Nodes

Data Volume ~ Nodes ~ Nodes ~ Nodes

Manageability Yes (EMR bootstrap) Do it yourself EC2 + Auto Scaling

Fault Tolerance Built-in Built-in KCL Check pointing

Programming languages Java, Python, Scala Java, Scala, Clojure Java, Python

Amazon Kinesis Analytics

Hadoop based Analysis

AmazonSQS

Amazon S3

DynamoDB

AnySQLorNOSQLStore

LogAggregationtools

Amazon EMR

Your choice of tools on Hadoop/EMR

AmazonSQS

Amazon S3

DynamoDB

AnySQLorNOSQLStore

LogAggregationtools

Amazon EMR


AmazonSQS

Amazon S3

DynamoDB

AnySQLorNOSQLStore

LogAggregationtools

Amazon EMR


AmazonSQS

Amazon S3

DynamoDB

AnySQLorNOSQLStore

LogAggregationtools

Amazon EMR

Spark and Shark

Cloudera Impala

Hadoop is good for

1. Ad Hoc Query analysis2. Large Unstructured Data Sets 3. Machine Learning and Advanced Analytics4. Schema less

SQL based Low Latency Analytics on structured data

SQL based processing

AmazonSQS

Amazon S3

DynamoDB

AnySQLorNOSQLStore

LogAggregationtools

Amazon Redshift

Petabyte scale Columnar Data -warehouse

SQL based processing for unstructured data

AmazonSQS

Amazon S3

DynamoDB

AnySQLorNOSQLStore

LogAggregationtools

Amazon EMR

Amazon Redshift

Pre-processing framework

Petabyte scale Columnar Data -warehouse

Your choice of BI Tools on the cloud

AmazonSQS

Amazon S3

DynamoDB

AnySQLorNOSQLStore

LogAggregationtools

Amazon EMR

Amazon Redshift

Pre-processing framework

Generation




Collaboration and Sharing insights

AmazonSQS

Amazon S3

DynamoDB

AnySQLorNOSQLStore

LogAggregationtools

Amazon EMR

Amazon Redshift

Sharing results and visualizations

AmazonSQS

Amazon S3

DynamoDB

AnySQLorNOSQLStore

LogAggregationtools

Amazon EMR

Amazon Redshift

Web App ServerVisualization tools

Sharing results and visualizations and scale

AmazonSQS

Amazon S3

DynamoDB

AnySQLorNOSQLStore

LogAggregationtools

Amazon EMR

Amazon Redshift

Web App ServerVisualization tools

Sharing results and visualizations

AmazonSQS

Amazon S3

DynamoDB

AnySQLorNOSQLStore

LogAggregationtools

Amazon EMR

Amazon Redshift Business

Intelligence Tools

Business Intelligence Tools

Geospatial Visualizations

AmazonSQS

Amazon S3

DynamoDB

AnySQLorNOSQLStore

LogAggregationtools

Amazon EMR

Amazon Redshift Business

Intelligence Tools


GIS tools on hadoop

GIS tools

Visualization tools

Rinse and Repeat

AmazonSQS

Amazon S3

DynamoDB

AnySQLorNOSQLStore

LogAggregationtools

Amazon EMR

Amazon Redshift

Visualization tools



GIS tools on hadoop

GIS tools

Amazon data pipeline

The complete architecture

AmazonSQS

Amazon S3

DynamoDB

AnySQLorNOSQLStore

LogAggregationtools

Amazon EMR

Amazon Redshift

Visualization tools



GIS tools on hadoop

GIS tools

Amazon data pipeline

Expanding analytics architecture

Adding Amazon Kinesis Analytics, Amazon Machine Learning, and Amazon ElasticSearch

Amazon RedshiftAmazon Elastic MapReduce

Amazon Glacier

Amazon DynamoD

B

Amazon Machine Learning

Amazon Kinesis

Data WarehouseSemi-structured NoSQL Predictive Models

Other AppsStreaming

Amazon Simple Storage Service

Data Lake Archive

Log Generato

r

Creating summary tables from log table

Amazon Elasticsearch Service

AWSLambda

Amazon Kinesis

Analytics