aws re:invent re:cap - 비용 최적화 - 모범사례와 아키텍처 설계 심화편 - 이원일
TRANSCRIPT
Whether you're a startup getting to profitability or an enterprise
optimizing spend, it pays to run cost-efficient architectures on AWS.
Building on last year's popular foundation of how to reduce waste
and fine-tune your AWS spending, this session reviews a wide range
of cost planning, monitoring, and optimization strategies, featuring
real-world experience from AWS customer Adobe Systems. With the
massive growth of subscribers to Adobe's Creative Cloud, Adobe's
footprint in AWS continues to expand. We will discuss the techniques
used to optimize and manage costs, while maximizing performance
and improving resiliency.
When traditional application and operating practices are used in
cloud deployments, immediate benefits occur in speed of
deployment, automation, and transparency of costs. The next step is
a re-architecture of the application to be cloud-native, and significant
operating cost reductions can help justify this development work.
Cloud-native applications are dynamic and use ephemeral resources
that customers are only charged for when the resources are in use.
With AWS, you can reduce capital costs, lower your overall bill, and
match your expense to your usage. This session describes how to
calculate the total cost of ownership (TCO) for deploying solutions on
AWS vs. on-premises or at a colocation facility, as well as how to
address common pitfalls in building a TCO analysis. The session
presents and models customer examples.
This session is a deep dive into techniques used by successful
customers who optimized their use of AWS. Learn tricks and hear
tips you can implement right away to reduce waste, choose the most
efficient instance, and fine-tune your spending; often with improved
performance and a better end-customer experience. We showcase
innovative approaches and demonstrate easily applicable methods to
save you time and money with Amazon EC2, Amazon S3, and a host
of other services.
In this session, you learn how you can leverage AWS services
together with third-party storage appliances and gateways to
automate your backup and recovery processes so that they are not
only less complex and lightweight, but also easy to manage and
maintain. We demonstrate how to manage data flow from on-
premises systems to the cloud and how to leverage storage
gateways. You also learn best practices for quick implementation,
reducing TCO, and automating lifecycle management.
In the event of a disaster, you need to be able to recover lost data
quickly to ensure business continuity. For critical applications,
keeping your time to recover and data loss to a minimum as well as
optimizing your overall capital expense can be challenging. This
session presents AWS features and services along with Disaster
Recovery architectures that you can leverage when building highly
available and disaster resilient applications. We will provide
recommendations on how to improve your Disaster Recovery plan
and discuss example scenarios showing how to recover from a
disaster.
• Pay as you go, no up-front investments
• Low ongoing cost
• Flexible capacity
• Speed, agility, and innovation
• Focus on your business
• Go global in minutes
Strategy 1:
Do nothing
Ecosystem
Global Footprint
New Features
New Services
More AWS
Usage
More
Infrastructure
Lower
Infrastructure
Costs
Reduced
Prices
More
Customers Infrastructure
Innovation
45 price
reductions
since 2006 Economies
of Scale
Strategy 2:
Do almost nothing
aws.amazon.com/premiumsupport/trustedadvisor/
Free with Business or Enterprise Support
Strategy 3:
Optimize Architecture
Cloud-Ready Cloud-Aware Cloud-Native
• Run AWS like a virtual colocation
(Fork-lift)
• Does not optimize for on-demand (overprovisioned)
• Minor modifications to improve
cloud usage
• Automating servers can lower operational burden
• Redesign with AWS in mind
(high effort)
• Embrace scalable services (reduce admin)
• EC2, EBS
• HAProxy on EC2
• MySQL on EC2
• Cassandra, Hadoop on EC2
• ActiveMQ/Redis/KAFKA on EC2 • Chef on EC2
• EC2, EBS, S3, CloudFront
• ELB, Route53(round-robin)
• Multi-AZ RDS + read replica
• ElastiCache Redis • OpsWorks
• Autoscaling, Self-healing
• Route53(LBR)
• RDS Aurora, RedShift
• DynamoDB, EMR
• SQS, SNS, Kinesis
• CloudFormation, Elastic Beanstalk
Development Cost Scalability/Availability Management Cost
• Developer, test, training instances
• Use simple instance start and stop
• Or tear down and build up all together
• Instances are disposable
• Automate, automate, automate: – AWS CloudFormation
– Weekend/off-hours scripts
– Use tags
Monday Friday End of Vacation Season 35% saved
Automatic resizing of compute clusters
based on demand
Trigger autoscaling
policy
Feature Details
Control Define minimum and maximum instance pool sizes and when scaling and cool down occurs.
Integrated to Amazon CloudWatch
Use metrics gathered by CloudWatch to drive scaling.
Instance types Run Auto Scaling for On-Demand and Spot Instances. Compatible with VPC.
AWS autoscaling create-autoscaling-group
— Auto Scaling-group-name MyGroup
— Launch-configuration-name MyConfig
— Min size 4
— Max size 200 — Availability Zones us-west-2c
Amazon
CloudWatch
Cloud capacity used is maybe half average DC capacity
Mad scramble to add more DC capacity during launch phase outages
Capacity wasted on failed launch magnifies the losses
Start
Choose an instance
that best meets your
basic requirements
Start with memory & then
choose closest virtual
cores
Look for peak IOPS
storage requirements
Tune
Change instance size up
or down based upon
monitoring
Use CloudWatch &
Trusted Advisor to assess
Roll-Out
Run multiple instances
in multiple Availability
Zones
1, 1.7, $0.060 1, 3.75, $0.113
2, 3.75, $0.145 2, 7.5, $0.225
2, 17.1, $0.410
4, 7, $0.300
4, 15, $0.450
4, 34.2, $0.820
8, 15, $0.600
8, 30, $0.900
8, 68.4, $1.640
4, 30.5, $0.853
8, 61, 1.705
16, 30, $1.200 32, 60, $2.400
32, 244, $3.500
16, 122, $3.410
16, 117, $4.600
32, 244, $6.820
0
50
100
150
200
250
300
0 5 10 15 20 25 30
On Demand Prices shown (N.Virginia region), only latest generation instances (M3,C3) shown where applicable, GPU and Micro instances not shown above
Memory-Optimized Instances
Compute-Optimized Instances
General Purpose Instances
Storage-Optimized Instances
vCPU
RA
M
More small instances vs. Less large instances
29 m3.xlarge
= 29 x $0.280/hour
= $8.12/hour
69 m3.medium
= 69 x $0.070/hour
= $4.83/hour
40%
Savings
1 5 9 13 17 21 25 29 33 37 41 45 49
We
b S
erv
ers
Week
50% Savings Weekly CPU Load
Scale up/down
by 70%+
Move to Load-Based
Scaling
50% Savings
Auto Scaling in the Amazon Cloud
http://techblog.netflix.com/2012/01/auto-scaling-in-amazon-cloud.html
Reactive Auto Scaling saves around 50%
Requests
Servers
50% Savings
Predictive Auto Scaling saves around 70%
Load prediction
Autoscaling Plan
Scryer: Netflix’s Predictive Auto Scaling Engine
http://goo.gl/iFefxJ
70% Savings
1y RI
Break even
3y RI
Break even
• No Upfront
You pay nothing upfront but commit to pay for the Reserved Instance over the
course of the Reserved Instance term, with discounts (typically about 30%)
when compared to On-Demand. This option is offered with a one year term
• Partial Upfront
You pay for a portion of the Reserved Instance upfront, and then pay for the
remainder over the course of the one or three year term. This option balances
the RI payments between upfront and hourly.
• All Upfront
You pay for the entire Reserved Instance term (one or three years) with one
upfront payment and get the best effective hourly price when compared to
On-Demand.
62%
Savings
77%
Savings
47%
Savings
65%
Savings
39%
Savings
63%
Savings
• Can be moved between AZs
• Can be moved between
EC2-Classic and EC2-VPC platforms
• Size can be modified within the
same instance family
• Price based on supply/demand
• You choose your maximum price/hour
• Your instance is started if the Spot price is lower
• Your instance is terminated if
the Spot price is higher
• But: You did plan for fault tolerance, didn’t you?
On-Demand:
$0.24
$0.028 (11.7%) $0.026 (10.8%)
90%
Savings
• Very dynamic pricing
• Opportunity to save 80-90% cost – But there are risks
• Different prices per AZ
• Leverage Auto Scaling! – One group with Spot Instances
– One group with On-Demand
– Get the best of both worlds
• Coming soon: 2-minute Spot interruption warnings
• Reduced redundancy storage class – 99.99% durability vs. 99.999999999%
– Up to 20% savings
– Everything that is easy to reproduce
– Use Amazon SNS lost object notifications
• Amazon Glacier storage class – Same 99.999999999% durability
– 3 to 5 hours restore time
– Up to 64% savings
– Archiving, long-term backups, and old data
• Use life-cycle rules
64%
Savings
20%
Savings
• Read/write capacity units (CUs) determine
most of DynamoDB cost
• By optimizing CUs, you can save a lot of money
• But: – Need to provision enough capacity to not run into capacity errors
– Need to prepare for peaks
– Need to constantly monitor/adjust
• Use caching to save read capacity units – Local RAM caches at app server instances
– Check out Amazon ElastiCache
• Think of strategies for optimizing CU use – Use multiple tables to support varied access patterns
– Understand access patterns for time series data
– Compress large attribute values
• Use Amazon SQS to buffer over-capacity writes
EC2
1. 2.
3. 4.
Caching/Optimization:
80% saved
Cache
flush
Dynamic
DynamoDB:
20% saved
Growth +
new features
80%
Savings
20%
Savings
• The more you can offload, the less
infrastructure you need to maintain, scale,
and pay for
• Three easy ways to offload: – Use Amazon CloudFront
– Introduce caching
– Leverage existing Amazon web services
• Amazon RDS, Amazon DynamoDB or Amazon
ElastiCache for Redis, Amazon Redshift – Instead of running your own database
• Amazon CloudSearch – Instead of running your own search engine
• Amazon Elastic Transcoder
• Amazon Elastic MapReduce
• Amazon Cognito, Amazon SQS, Amazon SNS,
Amazon Simple Workflow Service, Amazon SES,
Amazon Kinesis, and more …
November 14, 2014 | Las Vegas
Adrian Cockcroft @adrianco, Battery Ventures
@adrianco
Bill
Now Next
Month Ages Ago
Lease Building
Install AC etc.
Rack and Stack
Private Cloud SW
Run My Stuff
Data Center Up-Front Costs
0
25
50
75
100
125
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
Three Years Halving Every 18mo = maybe 40% overall savings
Data shown is purely illustrative
Older m1/m2 families
• Slower CPUs
• Higher response times
• Smaller caches (6MB)
• Oldest m1.xlarge
– 15G/8.5ECU/35c 23ECU/$
• Old m2.xlarge
– 17G/6.5ECU/25c 26ECU/$
New m3 family
• Faster CPUs
• Lower response times
• Larger caches (20MB)
• Java perf ratio > ECU
• New m3.xlarge
– 15G/13ECU/28c 46ECU/$
• 77% better ECU/$
• Deploy fewer instances
Combinations
100
70 70 70
30 30 25
0
25
50
75
100
125
Base Price Rightsized Seasonal Daily Scaling Reserved Tech Refresh Price Cuts
Traditional application using AWS heavy-use reservations
Base price is for capacity bought up-front
100
70
50
35 25 20 15
0
25
50
75
100
125
Base Price Rightsized Seasonal Daily Scaling Reserved Tech Refresh Price Cuts
Cloud-native application partially optimized light use reservations
100
50
25
12 8 6 4 0
25
50
75
100
125
Base Price Rightsized Seasonal Daily Scaling Reserved Tech Refresh Price Cuts
Cloud-native application fully optimized autoscaling mixed reservation use costs 4% of base price over three years!
• Business logic isolation in stateless micro-services
• Immutable code with instant rollback
• Autoscaled capacity and deployment updates
• Distributed across availability zones and regions
• De-normalized single function NoSQL data stores
• See over 40 NetflixOSS projects at netflix.github.com
• Get “technical indigestion” trying to keep up with
techblog.netflix.com
AdRoll, an online advertising platform, serves 50
billion impressions a day worldwide with its global
retargeting platforms.
We spend more on snacks
than we do on Amazon
DynamoDB.
• Needed high-performance, flexible
platform to swiftly sync data for worldwide
audience
• Processes 50 TB of data a day
• Serves 50 billion impressions a day
• Stores 1.5 PB of data
• Worldwide deployment minimizes latency
Valentino Volonghi
CTO, Adroll
”
“ Adroll Uses AWS to Grow by More Than 15,000% in a Year
• Handle 150TB/day
• Low <5ms response time
• 1,000,000+ global requests/second
• 100B items
• Memcache
aOpen source
aMature
aBlazingly fast
rNo strong guarantees
• Redis
aOpen source
rStorage scale
rNot really distributed
rOperationally intense.
• Hbase (we still use this)
aOpen source
aMaturing quickly
aGreat scale
rReally hard to operate
a
a
a
r
• Revisiting 1 million writes per second (Netflix) http://techblog.netflix.com/2014/07/revisiting-1-million-writes-per-second.html
• Mix is 10% writes/90% reads, 1M ops/sec is total capacity.
Cassandra DynamoDB Delta
10/90 mix, $/month $287,064 $131,040 219%
50/50 mix, $/month $287,064 $280,800 ~0%
10/90, 3-yr reserved $27,075.6
($904k upfront)
$15,736
($504k upfront)
180%
• 10 people Cassandra ops team: $150k/month (fully loaded)
• 0 DynamoDB ops team: $0
Data Collection = Batch Layer Bidding = Speed Layer
Data Collection
Data Storage
Global
Distribution
Bid Storage
Bidding
US East region
Availability Zone Availability Zone
Elastic Load Balancing
instances instances Auto Scaling group
Amazon S3 Amazon Kinesis
US East region
Availability Zone Availability Zone
Elastic Load Balancing
instances instances Auto Scaling group
Amazon S3 Amazon Kinesis
Apache Storm DynamoDB
US West region
EU West region
DynamoDB
DynamoDB
Bidding Data Collection US East region
Availability Zone Availability Zone
Elastic Load Balancing
instances
instances
Auto Scaling group
Amazon S3
Amazon Kinesis
Apache Storm
DynamoDB
Availability Zone Availability Zone
Auto Scaling group
Elastic Load Balancing
Data Collection
Bidding
Ad Network 2Ad Network 1
Auto Scaling GroupAuto Scaling GroupAuto Scaling GroupAuto Scaling Group Auto Scaling GroupAuto Scaling Group
Auto Scaling GroupAuto Scaling Group Auto Scaling Group
Apache Storm
v2 V3 V3v1 v2 V3 V3v1
V2 V3 V3V1
Auto Scaling Group
V3 V4
Elastic Load Balancing Elastic Load Balancing Elastic Load Balancing Elastic Load Balancing
DynamoDB
Write
Read Read Read ReadRead Read
WriteWrites
WriteWrite
Read
V3`
Elastic Load Balancing
Elastic Load Balancing
Elastic Load Balancing
Elastic Load Balancing
Elastic Load Balancing
Elastic Load Balancing
DynamoDB
Data Collection
Bidding
DynamoDB
Write
Read
Read
Write
Write
WriteAmazon S3
Amazon Kinesis
Data Collection
• Amazon EC2, Elastic Load Balancing, Auto Scaling
Store
• Amazon S3 + Amazon Kinesis
Global Distribution
• Apache Storm on Amazon EC2
Bid Store • DynamoDB
Bidding
• Amazon EC2, Elastic Load Balancing, Auto Scaling
Cloud-Ready Cloud-Aware Cloud-Native
• Run AWS like a virtual colocation
(Fork-lift)
• Does not optimize for on-demand (overprovisioned)
• Minor modifications to improve
cloud usage
• Automating servers can lower operational burden
• Redesign with AWS in mind
(high effort)
• Embrace scalable services (reduce admin)
• EC2, EBS
• HAProxy on EC2
• MySQL on EC2
• Cassandra, Hadoop on EC2
• ActiveMQ/Redis/KAFKA on EC2
• Chef on EC2
• EC2, EBS, S3, CloudFront
• ELB, Route53(round-robin)
• Multi-AZ RDS + read replica
• ElastiCache Redis
• OpsWorks
• Autoscaling, Self-healing
• Route53(LBR)
• RDS Aurora, RedShift
• DynamoDB, EMR
• SQS, SNS, Kinesis
• CloudFormation, Elastic Beanstalk
Development Cost Scalability/Availability Management Cost