aws re:invent - optimizing costs with aws
Post on 11-May-2015
3.021 Views
Preview:
DESCRIPTION
TRANSCRIPT
Optimizing Costs with AWS
Coburn Watson, Manager - Cloud Performance
Netflix Inc.
With more than 30 million streaming members in the United States, Canada, Latin America, the United Kingdom, Ireland and the
Nordics, Netflix, Inc. is the world's leading internet subscription service for enjoying movies and TV programs.
Source: http://ir.netflix.com
Agenda
• Rationale and High-level Methodology • AWS resource-specific optimizations• Performance Testing• Results• Q&A
Rationale andHigh-level Methodology
Rationale• Applications operate at massive scale
• Across three regions and multiple zones per region
• Service oriented architecture • Many moving parts (teams)
• Unconstrained deployment capabilities• “Freedom and Responsibility” culture
Rationale, cont.• Improve availability
• Avoid saturation of key resources• Dynamically adjust capacity to meet workload demands
• Plan for increasing workloads• Less focus on reducing current demand
• Maximize efficiency• Balance OLTP and batch demands
• “That which is measured improves”
Deployment Example• Asgard framework enables turnkey deployment (Netflix open-sourced)
• All engineers have full access
• Real-time reservation capacity
• Unconstrained ASG size limits
Methodology
• Manual• Weekly usage review; leverage Netflix “AWS usage” tool
• Identify unexpected on-demand trends• Review reservation use efficiency• Trend “cost per key event” (e.g. cost per stream event, etc.)
• As-needed• Evaluate utilization and autoscaling efficiency for key services
• Automated• Weekly email to service teams with AWS usage trend (EC2, S3, SimpleDB)• Available reservations exposed real-time to engineers• Janitor Monkey
AWS usage tool
• Pulls cost and usage information from AWS APIs
• Birds-eye view of usage• Near real-time data• Open sourcing plans for tool• Decomposes by application
AWS usage automated email reports
• Weekly email to teams with 4-week cost trend on EC2, S3, and SimpleDB
Janitor Monkey
• Fully Automated• Seeks to reduce “unintentional” resource usage due to
failed cleanup• Cleans up the following resources
• EC2 instances• EBS volumes• EBS snapshots• Launch Configurations*• Autoscaling Groups*• Security Groups*
• Reduces cost and clutter(*)
AWS resource-specific optimizations:
EC2: Primary optimization goals
• Align services to relatively few instance categories• Fewer, larger pools to work with• Common classes (e.g. m2.*)
• Autoscale, autoscale, autoscale• Identify workload components which can utilize excess reservation capacity
• Increase per-instance utilization (CPU, IO, Memory)
• Minimize duration of ASG “overlap” during code pushes
EC2: Autoscaling - Benefits
• Improved efficiency and availability• Avoids setting fixed ASG “max” instance count arbitrarily high
• Optimize resource allocation for mixed workloads• Batch activity can consume unused capacity during OLTP off-peak periods
• Insulate services from unexpected bursts in demand• “Super Bowl” Effect• Chained services that “scale together stay together”
EC2: Autoscaling - Challenges
• Effectively consuming the unused reservation capacity provided through autoscaling• Problem compounded: Large services often scale up or down on the same schedule
7/3/12 15:597/5/12 9:00 7/7/12 2:000
400
800
1,200
1,600
2,000
Unused Reservation Instance Hours *
Need touse this capacity
* - fictitious volumes
EC2: Autoscaling Methodology
• Prioritize service migration to autoscaling• Start with large services• Work downstream to dependent services
• Identify metric to leverage for scaling alarm• Rate-based (requests per second), or Load-based (load average) • More aggressive scale-up versus scale-down• Netflix internal metrics published directly to CloudWatch with Servo *
* Netflix OSS library
EC2: Autoscaling Methodology, cont.
• Validate with load tests • Avoid double-jump or thrashing conditions• Variable instance startup times can result in double-jump
• Autoscaling batch applications• Leverage “scheduled actions” • Maximize consumption of spare reservation capacity
EC2: Simplify Autoscaling Configuration
• Expose Autoscaling capabilities through Asgard• Scaling policies and scheduled actions:
EC2: Autoscaling profile examples
Healthy
Thrashing
Double-Jump
Y-axis = number of instances in ASG
• Once autoscaling in place, focus on improved system utilization• OLTP workloads target 45-60% CPU utilization• Batch workloads target 80%+ to maximize throughput
• Need to be cautious• Some services can have network IO, or other non-CPU as primary limiting factor
EC2: Improve system utilization
SQS: Usage and optimization
• Analytics and log processing infrastructure leverage SQS heavily• Cost is a function of request and data transfer volume• Messages typically small, primary optimization is through request rate
reduction• Adopted AWS SQS API batch capabilities as they evolved
• SQS batching allows up to 10 messages per batch
• 5B messages a day Q1 2012• Implemented batch send and delete capabilities mid 2012
SQS: Request Rate Reduction
Adopted batch delete
Started batch sendadoption
Batch capabilitiesAdoption complete
Time
Re
qu
est
s/d
ay
S3: Buckets…
• S3 usage can take off quickly
• Basic management tactics• Optimize access: Reduce payload size, reduce number of accesses• Age data out with TTL: Deletes are free; scans to find items to delete are not
• Investigate unexpected access patterns and growth trends• Misconfigured archive processes• S3 accesses failing auth at high rates
• Large files decomposed into multi-part upload; each “part” is an access
S3: Logs…
• Can quickly become a primary consumer of S3 capacity
• Reduce volume and access rate• Provide platform libraries with desired behavior• Push logs at infrequent intervals and set appropriate expiry tags
• For “mined” log data find alternate streamlined repositories• Netflix streams data through Chukwa and into Hive for reporting purposes
Performance Testing
• Load tests in test environment• Primarily used to evaluate ASG size requirements and
characterize service resource profile• Up to production scale infrastructure• Leverage homegrown Jenkins + jmeter load test framework
• “Squeeze tests”• Primarily used to identify per-instance capacity• Distribute traffic in production across multiple ASGs
• Reduce size of one ASG to evaluate impact of increased request rate on both performance and utilization characteristics
Results: Efficiency improvements…validated
• 2x the customer traffic, same amount of AWS as 10 months ago
• Optimized EC2: fewer, larger pools of instance types
• Batch activity leverages unused reservation capacity
• Engineering velocity remains unconstrained by capacity management
Netflix Open Source - @NetflixOSS on Github
Open Source Projects - @NetflixOSS on Github
Github / Techblog
Apache Contributions
Techblog Post Only
Coming Soon
Priam
Cassandra as a ServiceAstyanax
Cassandra client for JavaCassJMeter
Cassandra test suite
Cassandra Multi-region EC2 datastore support
Aegisthus
Hadoop ETL for Cassandra
Explorers
Governator - Library lifecycle and dependency injection
Odin
Workflow orchestration
Blitz4j - Async logging
Exhibitor
Zookeeper as a ServiceCurator
Zookeeper PatternsEVCache
Memcached as a ServiceEureka / Discovery
Service DirectoryArchaius
Dynamics Properties ServiceEdda
Queryable config history
Server-side latency/error injection
REST Client + mid-tier LB
Configuration REST endpoints
Servo and Autoscaling Scripts
Honu
Log4j streaming to HadoopCircuit Breaker - Hystrix
Robust service pattern
Asgard - AutoScaleGroup based AWS console
Chaos Monkey
Robustness verification
Latency Monkey
Janitor Monkey
Bakeries and AMI
Build dynaslaves
Legend
Netflix at 2012 re:Invent
Date/Time Presenter Topic
Wed 8:30-10:00 Reed Hastings Keynote with Andy Jassy
Wed 1:00-1:45 Coburn Watson Optimizing Costs with AWS
Wed 2:05-2:55 Kevin McEntee Netflix’s Transcoding Transformation
Wed 3:25-4:15 Neil Hunt / Yury I. Netflix: Embracing the Cloud
Wed 4:30-5:20 Adrian Cockcroft High Availability Architecture at Netflix
Thu 10:30-11:20 Jeremy Edberg Rainmakers – Operating Clouds
Thu 11:35-12:25 Kurt Brown Data Science with Elastic Map Reduce (EMR)
Thu 11:35-12:25 Jason Chan Security Panel: Learn from CISOs working with AWS
Thu 3:00-3:50 Adrian Cockcroft Compute & Networking Masters Customer Panel
Thu 3:00-3:50 Ruslan M./Gregg U. Optimizing Your Cassandra Database on AWS
Thu 4:05-4:55 Ariel Tseitlin Intro to Chaos Monkey and the Simian Army
We are sincerely eager to hear your FEEDBACK on this presentation and on re:Invent.
Please fill out an evaluation form when you have a
chance.
Contact: cwatson@netflix.com
top related