aws를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- aws 웨비나 시리즈 2015

심화 웨비나 시리즈 | 8 번째 강연 2015년 7월 9일 목요일 | 오후 2시

http://aws.amazon.com/ko

AWS를 활용한 첫 빅데이터 프로젝트 시작하기

김일호, Solutions Architect

이번 웨비나 에서 들으실 내용..

이 강연에서는 AWS Elastic MapReduce, Amazon Redshift,

Amazon Kinesis 등 AWS가 제공하는 다양한 데이터 분석 도구를 활용해 보다 간편하고 빠른 빅데이터 분석 서비스를 구축하는 방법에 대해 소개합니다.

v

Agenda •  AWS Big data building blocks

•  AWS Big data platform

•  Log data collection & storage

•  Introducing Amazon Kinesis

•  Data Analytics & Computation •  Collaboration & sharing

•  Netflix Use-case

AWS Big data building blocks (brief)

Use the right tools

Amazon S3

Amazon Kinesis

Amazon DynamoDB

Amazon Redshift

Amazon Elastic

MapReduce

Store anything Object storage Scalable 99.999999999% durability

Amazon S3

Real-time processing High throughput; elastic Easy to use EMR, S3, Redshift, DynamoDB Integrations

Amazon Kinesis

NoSQL Database Seamless scalability Zero admin Single digit millisecond latency

Amazon DynamoDB

Relational data warehouse Massively parallel Petabyte scale Fully managed $1,000/TB/Year

Amazon Redshift

Hadoop/HDFS clusters Hive, Pig, Impala, Hbase Easy to use; fully managed On-demand and spot pricing Tight integration with S3, DynamoDB, and Kinesis

Amazon Elastic

MapReduce

HDFS

Amazon RedShift

Amazon RDS

Amazon S3 Amazon DynamoDB

Amazon EMR

Amazon Kinesis

AWS Data Pipeline

Data management Hadoop Ecosystem analy8cal tools

Data Sources

AWS Data Pipeline

v

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

v

a

Amazon DynamoDB

Amazon RDS

Amazon Redshift

AWS Direct Connect

AWS Storage Gateway

AWS Import/ Export

Amazon Glacier S3 Amazon

Kinesis Amazon EMR

Generation




v

Amazon EC2 Amazon EMR Amazon Kinesis

Generation




v

Amazon Redshift

Amazon DynamoDB

Amazon RDS

S3 Amazon EC2 Amazon EMR

Amazon CloudFront

AWS CloudForma8on

AWS Data Pipeline

Generation




The right tools. At the right scale. At the right time.

AWS Big data platform

v

Generation




v

Collection of Data

Sources Aggrega8on Tool Data Sink

Web Servers Applica8on servers Connected Devices Mobile Phones

Etc

Scalable method to collect and aggregate Flume, KaGa, Kinesis,

Queue

Reliable and durable des8na8on OR Des8na8ons

Types of Data Ingest •  Transactional

–  Database reads/writes

•  File –  Click-stream logs

•  Stream –  Click-stream logs

Database

Cloud Storage

Stream Storage

Run your own log collector

Your applica0on Amazon S3

DynamoDB

Any other data store

Amazon S3

Amazon EC2

Use a Queue

Amazon Simple Queue Service (SQS)

Amazon S3

DynamoDB


Agency Customer: Video Analytics on AWS

Elastic Load Balancer

Edge Servers on EC2

Workers on EC2

Logs Reports

HDFS Cluster

Amazon Simple Queue Service (SQS)

Amazon Simple Storage Service (S3)

Amazon Elastic MapReduce

Use a Tool like FLUME, KAFKA, HONU etc

Flume running on EC2

Amazon S3


HDFS

v

Choice of tools

•  (+) Pros / (-) Cons •  (+) Flexibility: Customers select the most appropriate software and underlying

infrastructure •  (+) Control: Software and hardware can be tuned to meet specific business and

scenario needs. •  (-) Ongoing Operational Complexity: Deploy, and manage an end-to-end system •  (-) Infrastructure planning and maintenance: Managing a reliable, scalable

infrastructure •  (-) Developer/ IT staff overhead: Developers, Devops and IT staff time and

energy expended •  (-) Unsupported Software: deprecated and/ pre-version 1 open source software •  Future – Need for to stream data for real time

Stream Storage

Database

Cloud Storage

29

Why Stream Storage? • Convert multiple streams into fewer persistent sequential streams

• Sequential streams are easier to process

Amazon Kinesis or KaGa 4 4 3 3 2 2 1 14 3 2 1

4 3 2 1

4 3 2 1

4 3 2 1

4 4 3 3 2 2 1 1

Shard or Par88on 1

Shard or Par88on 2

Producer 1

Producer 2

Producer 3

Producer N

30

Amazon Kinesis or KaGa

Why Stream Storage? • Decouple producers and consumers

• Buffer

• Preserve client ordering • Streaming MapReduce • Consumer replay / reprocess

4 4 3 3 2 2 1 14 3 2 1

4 3 2 1

4 3 2 1

4 3 2 1

4 4 3 3 2 2 1 1

Producer 1

Shard or Par88on 1

Shard or Par88on 2

Consumer 1 Count of Red = 4

Count of Violet = 4

Consumer 2 Count of Blue = 4

Count of Green = 4

Producer 2

Producer 3

Producer N

Introducing Amazon Kinesis

Data Sources

App.4

[Machine Learning]

AW

S En

dpoint

App.1

[Aggregate & De-‐Duplicate]

Data Sources

Data Sources

Data Sources

App.2

[Metric Extrac0on]

S3

DynamoDB

Redshift

App.3 [Sliding Window Analysis]

Data Sources

Availability Zone

Shard 1 Shard 2 Shard N

Availability Zone

Availability Zone

Introducing Amazon Kinesis Managed Service for Real-Time Processing of Big Data

EMR

Kinesis Architecture

Amazon Web Services

AZ AZ AZ

Durable, highly consistent storage replicates dataacross three data centers (availability zones)

Aggregate andarchive to S3

Millions ofsources producing100s of terabytes

per hour

FrontEnd

AuthenticationAuthorization

Ordered streamof events supportsmultiple readers

Real-timedashboardsand alarms

Machine learningalgorithms or

sliding windowanalytics

Aggregate analysisin Hadoop or a

data warehouse

Inexpensive: $0.028 per million puts

Putting data into Kinesis Managed Service for Ingesting Fast Moving Data •  Streams are made of Shards ⁻  A Kinesis Stream is composed of mul8ple Shards

⁻  Each Shard ingests up to 1MB/sec of data, and up to 1000 TPS

⁻  Each Shard emits up to 2 MB/sec of data

⁻  All data is stored for 24 hours ⁻  You scale Kinesis streams by adding or removing Shards

•  Simple PUT interface to store data in Kinesis ⁻  Producers use a PUT call to store data in a Stream

⁻  A Par00on Key is used to distribute the PUTs across Shards ⁻  A unique Sequence # is returned to the Producer upon a successful PUT call

Producer

Shard 1

Shard 2

Shard 3

Shard n

Shard 4

Producer

Producer

Producer

Producer

Producer

Producer

Producer

Producer

Kinesis

POST / HTTP/1.1

Host: kinesis.<region>.<domain>

x-amz-Date: <Date>

Authorization: AWS4-HMAC-SHA256 Credential=<Credential>, SignedHeaders=content-type;date;host;user-agent;x-amz-date;x-amz-target;x-amzn-requestid, Signature=<Signature>

User-Agent: <UserAgentString>

Content-Type: application/x-amz-json-1.1

Content-Length: <PayloadSizeBytes>

Connection: Keep-Alive

X-Amz-Target: Kinesis_20131202.PutRecord

{

"StreamName": "exampleStreamName",

"Data": "XzxkYXRhPl8x",

"PartitionKey": "partitionKey"

}

v

Shard 1

Shard 2

Shard 3

Shard n

Shard 4

KCL Worker 1

KCL Worker 2

EC2 Instance

KCL Worker 3

KCL Worker 4

EC2 Instance

KCL Worker n

EC2 Instance

Kinesis

Building Kinesis Apps Client library for fault-tolerant, at least-once, real-time processing •  Key streaming application attributes:

•  Be distributed, to handle multiple shards

•  Be fault tolerant, to handle failures in hardware or software

•  Scale up and down as the number of shards increase or decrease

•  Kinesis Client Library (KCL) helps with distributed processing:

•  Automatically starts a Kinesis Worker for each shard

•  Simplifies reading from the stream by abstracting individual shards

•  Increases / Decreases Kinesis Workers as # of shards changes

•  Checkpoints to keep track of a Worker’s location in the stream

•  Restarts Workers if they fail

•  Use the KCL with Auto Scaling Groups

•  Auto Scaling policies will restart EC2 instances if they fail

•  Automatically add EC2 instances when load increases

•  KCL will redistributes Workers to use the new EC2 instances

OR

•  Use the Get APIs for raw reads of Kinesis data streams

37

Easy Administra0on Managed service for real-‐8me streaming data collec8on, processing and analysis. Simply create a new stream, set the desired level of capacity, and let the service handle the rest.

Real-‐0me Performance Perform con8nual processing on streaming big data. Processing latencies fall to a few seconds, compared with the minutes or hours associated with batch processing.

High Throughput. Elas0c Seamlessly scale to match your data throughput rate and volume. You can easily scale up to gigabytes per second. The service will scale up or down based on your opera8onal or business needs.

S3, EMR, Storm, RedshiY, & DynamoDB Integra0on

Reliably collect, process, and transform all of your data in real-‐8me & deliver to AWS data stores of choice, with Connectors for S3, Redshi], and DynamoDB.

Build Real-‐0me Applica0ons Client libraries that enable developers to design and operate real-‐8me streaming data processing applica8ons.

Low Cost Cost-‐efficient for workloads of any scale. You can get started by provisioning a small stream, and pay low hourly rates only for what you use.

Amazon Kinesis: Key Developer Benefits

Customers using Amazon Kinesis Mobile/ Social Gaming Digital Adver0sing Tech.

Deliver con8nuous/ real-‐8me delivery of game insight data by 100’s of game servers

Generate real-‐8me metrics, KPIs for online ad performance for adver8sers/ publishers

Custom-‐built solu8ons opera8onally complex to manage, & not scalable

Store + Forward fleet of log servers, and Hadoop based processing pipeline

•  Delay with cri8cal business data delivery •  Developer burden in building reliable, scalable pladorm for real-‐8me data inges8on/ processing

•  Slow-‐down of real-‐8me customer insights

•  Lost data with Store/ Forward layer •  Opera8onal burden in managing reliable, scalable pladorm for real-‐8me data inges8on/ processing

•  Batch-‐driven real-‐8me customer insights

Accelerate 8me to market of elas8c, real-‐8me applica8ons – while minimizing opera8onal overhead

Generate freshest analy8cs on adver8ser performance to op8mize marke8ng spend, and increase responsiveness to clients

Digital Ad. Tech Metering with Kinesis

Con0nuous Ad Metrics Extrac0on

Incremental Ad. Sta0s0cs Computa0on

Metering Record Archive

Ad Analy0cs Dashboard

v

Collection of Data

Sources Aggrega8on Tool Data Sink

Web Servers Applica8on servers Connected Devices Mobile Phones

Etc

Scalable method to collect and aggregate Flume, KaGa, Kinesis,

Queue

Reliable and durable des8na8on OR Des8na8ons

Cloud Database & Storage

Cloud Database and Storage Tier Anti-pattern

App/Web Tier

Client Tier

RDBMS

Database & Storage Tier = All-in-one?

Cloud Database and Storage Tier — Use the Right Tool for the Job!

App/Web Tier

Client Tier

Data Tier

Database & Storage Tier

Search

Hadoop/HDFS

Cache

Blob Store

SQL NoSQL

App/Web Tier

Client Tier

Database & Storage Tier

Amazon RDS Amazon

DynamoDB

Amazon Elas0Cache

Amazon S3 Amazon Glacier

Amazon CloudSearch

HDFS on Amazon EMR

Cloud Database and Storage Tier — Use the Right Tool for the Job!

v

What Database and Storage Should I Use?

• Data structure • Query complexity • Data characteristics: hot, warm, cold

Data Structure and Query Types vs Storage Technology

Structured – Simple Query NoSQL

Amazon DynamoDB Cache

Amazon ElastiCache

Structured – Complex Query SQL

Amazon RDS Search

Amazon CloudSearch

Unstructured – No Query Cloud Storage


Unstructured – Custom Query Hadoop/HDFS

Amazon Elastic MapReduce

Data Structure Com

plexity

Query Structure Complexity

What is the Temperature of Your Data?

Amazon RDS

Amazon S3

Request Rate High Low

Cost/GB High Low

Latency Low High

Data Volume Low High

Amazon Glacier

Amazon CloudSearch

Structure

Low

High

Amazon DynamoDB

Amazon Elas8Cache

HDFS

What Data Store Should I Use? Amazon Elas0Cache

Amazon DynamoDB

Amazon RDS

Amazon CloudSearch

Amazon EMR (HDFS)


Average latency

ms ms ms, sec ms,sec sec,min,hrs ms,sec,min (~ size)

hrs

Data volume GB GB–TBs (no limit)

GB–TB (3 TB Max)

GB–TB GB–PB (~nodes)

GB–PB (no limit)

GB–PB (no limit)

Item size B-‐KB KB (64 KB max)

KB (~rowsize)

KB (1 MB max)

MB-‐GB KB-‐GB (5 TB max)

GB (40 TB max)

Request rate Very High Very High High High Low – Very High

Low– Very High (no limit)

Very Low (no limit)

Storage cost $/GB/month

$$ ¢¢ ¢¢ $ ¢ ¢ ¢

Durability Low -‐ Moderate

Very High High High High Very High Very High

Hot Data Warm Data Cold Data

Decouple your storage and analysis engine 1.  Single Version of Truth 2.  Choice of multiple analytics Tools 3.  Parallel execution from different teams 4.  Lower cost

Learning

from Nealix

v

S3 as a “single source of truth”

Courtesy http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html

S3

Amazon SQS

Amazon S3

DynamoDB

Any SQL or NO SQL Store

Kinesis

Choose depending upon design

v

Generation




v

Process

• Answering questions about data

• Questions •  Analytics: Think SQL/data warehouse •  Classification: Think sentiment analysis •  Prediction: Think page-views prediction •  Etc

v

Processing Frameworks

Generally come in two major types: • Batch processing • Stream processing • Interactive query

v

Batch Processing • Take large amount of cold data and ask questions

• Takes minutes or hours to get answers back

Example: Genera-ng hourly, daily, weekly reports

Process

v

Stream Processing (AKA Real Time) • Take small amount of hot data and ask questions

• Takes short amount of time to get your answer back

Example: 1min metrics

Processing Tools •  Batch processing/analytic

–  Amazon Redshift –  Amazon EMR

•  Hive/Tez, Pig, Spark, Impala, Spark, Presto, ….

•  Stream processing –  Apache Spark streaming –  Apache Storm (+ Trident) –  Amazon Kinesis client and

connector library

Amplab Big Data Benchmark Scan query Aggregate query Join query

hops://amplab.cs.berkeley.edu/benchmark/

v

What Batch Processing Technology Should I Use? RedshiY Impala Presto Spark Hive

Query Latency

Low Low Low Low -‐ Medium Medium -‐ High

Durability High High High High High

Data Volume

1.6PB Max ~Nodes ~Nodes ~Nodes ~Nodes

Managed Yes EMR bootstrap

EMR bootstrap

EMR bootstrap Yes (EMR)

Storage Na8ve HDFS HDFS/S3 HDFS/S3 HDFS/S3

# of BI Tools

High Medium High Low High

Query Latency (Low is beoer)

v

What Stream Processing Technology Should I Use? Spark Streaming Apache Storm +

Trident Kinesis Client Library

Scale/Throughput ~ Nodes ~ Nodes ~ Nodes

Data Volume ~ Nodes ~ Nodes ~ Nodes

Manageability Yes (EMR bootstrap) Do it yourself EC2 + Auto Scaling

Fault Tolerance Built-‐in Built-‐in KCL Check poin8ng

Programming languages

Java, Python, Scala Java, Scala, Clojure

Java, Python

Hadoop based Analysis Amazon SQS

Amazon S3

DynamoDB


Log Aggrega0on tools

Amazon EMR

Your choice of tools on Hadoop/EMR Amazon SQS

Amazon S3

DynamoDB



Amazon EMR


Amazon S3

DynamoDB



Amazon EMR


Amazon S3

DynamoDB



Amazon EMR

Spark and Shark

Cloudera Impala

Hadoop is good for 1.  Ad Hoc Query analysis 2.  Large Unstructured Data Sets 3.  Machine Learning and Advanced Analytics 4.  Schema less

SQL based Low Latency Analytics on structured data

SQL based processing Amazon SQS

Amazon S3

DynamoDB



Amazon Redshift

Petabyte scale Columnar Data -warehouse

SQL based processing for unstructured data Amazon SQS

Amazon S3

DynamoDB



Amazon EMR

Amazon Redshift

Pre-processing framework

Petabyte scale Columnar Data -warehouse

Your choice of BI Tools on the cloud Amazon SQS

Amazon S3

DynamoDB



Amazon EMR

Amazon Redshift

Pre-processing framework

v

Generation




Collaboration and Sharing insights Amazon SQS

Amazon S3

DynamoDB



Amazon EMR

Amazon Redshift

Sharing results and visualizations Amazon SQS

Amazon S3

DynamoDB



Amazon EMR

Amazon Redshift

Web App Server

Visualization tools

Sharing results and visualizations and scale Amazon SQS

Amazon S3

DynamoDB



Amazon EMR

Amazon Redshift

Web App Server Visualization tools

Sharing results and visualizations Amazon SQS

Amazon S3

DynamoDB



Amazon EMR

Amazon Redshift Business

Intelligence Tools

Business Intelligence Tools

Geospatial Visualizations Amazon SQS

Amazon S3

DynamoDB



Amazon EMR

Amazon Redshift Business

Intelligence Tools


GIS tools on hadoop

GIS tools

Visualization tools

Rinse and Repeat Amazon SQS

Amazon S3

DynamoDB



Amazon EMR

Amazon Redshift

Visualization tools



GIS tools on hadoop

GIS tools

Amazon data pipeline

The complete architecture Amazon SQS

Amazon S3

DynamoDB



Amazon EMR

Amazon Redshift

Visualization tools



GIS tools on hadoop

GIS tools

Amazon data pipeline

Reference : BDT403 Next Genera8on Big Data Pladorm @ Nedlix

Architecture

v

Big Data • 10+ PB DW on S3 • 1.2 PB read daily • 100 TB written daily • ~ 200 billion events daily

Cloud apps

Suro Ursula

Cassandra Aegisthus

Dimension data

Event Data

15 min

Daily

Amazon S3

SS tables

Data Pipelines

Amazon S3

Storage Compute Service Tools

@2013

Amazon S3 v2.0

Storage Compute Service Tools

@2014

온라인 자습 및 실습

다양한 온라인 강의 자료 및 실습을 통해 AWS에 대한 기초적인 사용법 및 활용 방법을 익히

실 수 있습니다.

강의식 교육

AWS 전문 강사가 진행하는 강의를 통해 AWS 클라우드로 고가용성,

비용 효율성을 갖춘 안전한 애플리케이션을 만드는 방법을 알아보세요. 아키텍쳐 설계 및 구현에 대한 다양한 오프라인 강의가 개설되어

있습니다.

인증 시험을 통해 클라우드에 대한 자신의 전문 지식 및 경험을 공인받고 개발 경력을 제시할 수 있습

니다.

AWS 공인 자격증

http://aws.amazon.com/ko/training

다양한 교육 프로그램

AWS 기초 웨비나 시리즈에 참여해 주셔서 감사합니다! 이번 웨비나가 여러분의 궁금증 해소에 도움이 되었길 바랍니다. 이후 이어질 설문 조사를 통해 오늘 웨비나에 대한 의견을 알려주세요. [email protected] http://twitter.com/AWSKorea http://facebook.com/AmazonWebServices.ko http://youtube.com/user/AWSKorea http://slideshare.net/AWSKorea