aws summit seoul 2015 - aws 클라우드를 활용한 빅데이터 및 실시간 스트리밍...

53
SEOUL © 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Upload: amazon-web-services-korea

Post on 15-Jul-2015

765 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

SEOUL

© 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Page 2: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

실시간빅데이터및스트리밍분석

김일호 – AWS Solutions Architect

Page 3: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Agenda

• Batch Processing: Amazon Elastic MapReduce (EMR)

• Real-time Processing: Amazon Kinesis

• Cost-saving Tips

Page 4: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Page 5: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Page 6: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Batch processing

Amazon Elastic MapReduce (EMR)

Page 7: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Why Amazon EMR?

Easy to UseLaunch a cluster in minutes

Low CostPay an hourly rate

ElasticEasily add or remove capacity

ReliableSpend less time monitoring

SecureManage firewalls

FlexibleControl the cluster

Page 8: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Easy to deploy

AWS Management Console Command Line

Or use the Amazon EMR API with your favorite SDK.

Page 9: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Easy to monitor and debug

Integrated with Amazon CloudWatch

Monitor Cluster, Node, and IO

Monitor Debug

Page 10: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Hue

Amazon S3 and Hadoop distributed file system (HDFS)

Page 11: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Hue

Query Editor

Page 12: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Hue

Job Browser

Page 13: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Try different configurations to find your optimal architecture.

CPU

c3 family

cc1.4xlarge

cc2.8xlarge

Memory

m2 family

r3 family

Disk/IO

d2 family

i2 family

General

m1 family

m3 family

Choose your instance types

Batch Machine Spark and Large

process learning interactive HDFS

Page 14: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Easy to add and remove compute

capacity on your cluster.

Match compute

demands with

cluster sizing.

Resizable clusters

Page 15: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Spot Instances

for task nodes

Up to 90%

off Amazon EC2

on-demand

pricing

On-demand for

core nodes

Standard

Amazon EC2

pricing for

on-demand

capacity

Easy to use Spot Instances

Meet SLA at predictable cost Exceed SLA at lower cost

Page 16: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Use bootstrap actions to install applications…

https://github.com/awslabs/emr-bootstrap-actions

Page 17: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

…or to configure Hadoop

--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop

--keyword-config-file (Merge values in new config to existing)

--keyword-key-value (Override values provided)

Configuration File

Name

Configuration File

Keyword

File Name

Shortcut

Key-Value Pair

Shortcut

core-site.xml core C c

hdfs-site.xml hdfs H h

mapred-site.xml mapred M m

yarn-site.xml yarn Y y

Page 18: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Read data directly into Hive,

Apache Pig, and Hadoop

Streaming and Cascading from

Amazon Kinesis streams

No intermediate data

persistence required

Simple way to introduce real-time sources into

batch-oriented systems

Multi-application support and automatic

checkpointing

Amazon EMR Integration with Amazon Kinesis

Page 19: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Amazon EMR: Leveraging Amazon S3

Page 20: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Amazon S3 as your persistent data store

• Amazon S3

– Designed for 99.999999999% durability

– Separate compute and storage

• Resize and shut down Amazon EMR clusters with no data loss

• Point multiple Amazon EMR clusters at same data in Amazon S3

Page 21: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

EMRFS makes it easier to leverage Amazon S3

• Better performance and error handling options

• Transparent to applications – just read/write to “s3://”

• Consistent view

– For consistent list and read-after-write for new puts

• Support for Amazon S3 server-side and client-side encryption

• Faster listing using EMRFS metadata

Page 22: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

EMRFS support for Amazon S3 client-side encryption

Amazon S3

Am

azon S

3 e

ncry

ption

clie

nts

EM

RFS e

nable

d fo

rAm

azo

n S

3 c

lient-

sid

e e

ncryp

tion

Key vendor (AWS KMS or your custom key vendor)

(client-side encrypted objects)

Page 23: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Amazon S3 EMRFS metadata in Amazon DynamoDB

• List and read-after-write consistency

• Faster list operations

Number of

objects

Without Consistent

Views

With Consistent

Views

1,000,000 147.72 29.70

100,000 12.70 3.69

Fast listing of Amazon S3 objects using

EMRFS metadata

*Tested using a single node cluster with a m3.xlarge instance.

Page 24: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Optimize to leverage HDFS

• Iterative workloads – If you’re processing the same dataset more than once

• Disk I/O intensive workloads

Persist data on Amazon S3 and use S3DistCp to

copy to HDFS for processing.

Page 25: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Amazon EMR: Design patterns

Page 26: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Amazon EMR example #1: Batch processing

GBs of logs pushed

to Amazon S3 hourlyDaily Amazon EMR

cluster using Hive to

process data

Input and output

stored in Amazon S3

250 Amazon EMR jobs per day, processing 30 TB of data

http://aws.amazon.com/solutions/case-studies/yelp/

Page 27: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Amazon EMR example #2: Long-running cluster

Data pushed to

Amazon S3Daily Amazon EMR cluster

Extract, Transform, and Load

(ETL) data into database

24/7 Amazon EMR cluster

running HBase holds last 2

years’ worth of data

Front-end service uses

HBase cluster to power

dashboard with high

concurrency

Page 28: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Amazon EMR example #3: Interactive query

TBs of logs sent dailyLogs stored in

Amazon S3Amazon EMR cluster using Presto for ad hoc

analysis of entire log set

Interactive query using Presto on multipetabyte warehouse

http://techblog.netflix.com/2014/10/using-presto-in-our-big-

data-platform.html

Page 29: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Real-time Processing

Amazon Kinesis

Page 30: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Real-time analytics

Real-time ingestion

• Highly scalable

• Durable

• Elastic

• Re-playable reads

Continuous processing

• Load-balancing incoming streams

• Fault-tolerance, check-pointing and replay

• Elastic

• Enables multiple apps to process in parallel

Continuous data flow

Low end-to-end latency

Continuous, real-time workloads

+

Page 31: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Data ingestion

Page 32: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Global top 10

example.com

Starting simple...

Page 33: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Global top-10

Distributing the workload…

example.com

Page 34: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Global top10

Local top 10

Local top 10

Local top 10

Or using an elastic data broker…

example.com

Page 35: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Global top 10

Data

record

StreamShard

Partition key

Worker

My top 10

Data recordSequence number

14 17 18 21 23

Amazon Kinesis – managed stream

example.com

Amazon

Kinesis

Page 36: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

AW

S e

nd

po

int

Amazon

S3

Amazon

DynamoDB

Amazon

Redshift

Data

sources

Availability

Zone

Availability

Zone

Data

sources

Data

sources

Data

sources

Data

sources

Availability

Zone

Shard 1

Shard 2

Shard N

[Data

archive]

[Metric

extraction]

[Sliding-window

analysis]

[Machine

learning]

App. 1

App. 2

App. 3

App. 4

Amazon EMR

Amazon Kinesis – common data broker

Page 37: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Amazon Kinesis – stream and shards

•Stream: A named entity to

capture and store data

•Shards: Unit of capacity

•Put – 1 MB/sec or 1000

TPS

•Get - 2 MB/sec or 5 TPS

•Scale by adding or removing

shards

•Replay in 24-hr. window

Page 38: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

How to size your Amazon Kinesis stream

Consider 2 producers, each producing 2 KB records at 500 TPS:

Minimum of 2 shards for ingress of 2 MB/s

2 Applications can read with egress of 4MB/s

Shard

Shard

2 KB * 500 TPS = 1000 KB/s

2 KB * 500 TPS = 1000 KB/s

Application

Producers

Application

Page 39: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

How to size your Amazon Kinesis stream

Consider 3 consuming applications each processing the data

Simple! Add another shard to the stream to spread the load

Shard

Shard

2 KB * 500 TPS = 1000 KB/s

2 KB * 500 TPS = 1000 KB/s

Application

Application

Application

Producers

Shard

Page 40: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Amazon Kinesis – distributed streams

• From batch to continuous processing

• Scale UP or DOWN without losing sequencing

• Workers can replay records for up to 24 hours

• Scale up to GB/sec without losing durability

– Records stored across multiple Availability Zones

• Run multiple parallel Amazon Kinesis applications

Page 41: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Data processing

Page 42: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Batch

Micro

batch

Real

time

Pattern for real-time analytics…

Batch

analysisData Warehouse

Hadoop

Notifications

& alerts

Dashboards/

visualizations

APIsStreaming

analytics

Data

streams

Deep learning

Dashboards/

visualizations

Spark-Streaming

Apache Storm

Amazon KCL

Data

archive

Page 43: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Real-time analytics

• Streaming

– Event-based response within seconds; for example,

detecting whether a transaction is a fraud or not

• Micro-batch

– Operational insights within minutes; for example,

monitor transactions from different regions

Kinesis

Client

Library

Page 44: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Amazon Kinesis Client Library (Amazon KCL)

• Distributed to handle

multiple shards

• Fault tolerant

• Elastically adjusts to shard

count

• Helps with distributed

processing

Amazon

Kinesis

Stream

Amazon EC2

Amazon EC2

Amazon EC2

Page 45: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Amazon KCL design components

• Worker: The processing unit that maps to each application instance

• Record processor: The processing unit that processes data from a shard of an Amazon Kinesis stream

• Check-pointer: Keeps track of the records that have already been processed in a given shard

Amazon KCL restarts the processing of the shard at the last-known processed record if a worker fails

Page 46: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Amazon Kinesis Connector Library

• Amazon S3

– Archival of data

• Amazon Redshift

– Micro-batching loads

• Amazon DynamoDB

– Real-time Counters

• Elasticsearch

– Search and Index

S3 Dynamo DB Amazon

Redshift

Amazon

Kinesis

Page 47: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Read data directly into

Hive, Pig, Streaming,

and Cascading from

Amazon Kinesis

Real-time sources into batch-oriented systems

Multi-application support & check-pointing

EMR integration with Amazon

Kinesis

Page 48: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

DStream

RDD@T1 RDD@T2

Messages

Receiver

Spark streaming – Basic concepts

• Higher-level abstraction called Discretized Streams

(DStreams)

• Represented as sequences of Resilient Distributed

Datasets (RDDs)

http://spark.apache.org/docs/latest/streaming-kinesis-integration.html

Page 49: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Apache Storm: Basic concepts

• Streams: Unbounded sequence of tuples

• Spout: Source of stream

• Bolts: Processes that input streams and output new streams

• Topologies: Network of spouts and bolts

https://github.com/awslabs/kinesis-storm-spout

Page 50: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Batch

Micro

batch

Real

time

Putting it together…

Producer Amazon

Kinesis

App Client

EMRS3

Amazon KCL

DynamoDB

Amazon Redshift BI tools

Amazon KCL

Amazon KCL

Page 51: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Ref. re:invent 2014 BDT310

Page 52: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

Cost-saving tips

• Use Amazon S3 as your persistent data store (only pay for compute when you need it!).

• Use Amazon EC2 Spot Instances (especially with task nodes) to save 80 percent or more on the Amazon EC2 cost.

• Use Amazon EC2 Reserved Instances if you have steady workloads.

• Create CloudWatch alerts to notify you if a cluster is underutilized so that you can shut it down (e.g. Mappers running == 0 for more than N hours).

• Contact your sales rep about custom pricing options, if you are spending more than $10K per month on Amazon EMR.

Page 53: AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석

SEOUL

© 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved