#surgeconf scaling twitter to go after the fail whale

26
Scaling Twitter To Go After the Fail Whale Jonathan Reichhold - Twitter Engineering

Upload: jonathan-reichhold

Post on 15-Jan-2015

195 views

Category:

Technology


0 download

DESCRIPTION

Originally known for a "fail whale" that occurred frequently on the site, Twitter has changed significantly to make sure we are available no matter what is happening around the world without a blip. This goal felt unattainable three years ago, when the 2010 World Cup put Twitter squarely in the center of a real-time, global conversation. The influx of Tweets—from every shot on goal, penalty kick, and yellow or red card—repeatedly took its toll and made Twitter unavailable for short periods of time. Engineering worked throughout the nights during this time, desperately trying to find and implement order-of-magnitudes of efficiency gains. Unfortunately, those gains were quickly swamped by Twitter’s rapid growth, and engineering had started to run out of low-hanging fruit to fix. After that experience, we determined we needed to step back. We then determined we needed to re-architect the site to support the continued growth of Twitter and to keep it running smoothly. Since then we’ve worked hard to make sure that the service is resilient to the world’s impulses. We’re now able to withstand events like Castle in the Sky viewings, the Super Bowl, and the global New Year’s Eve celebration. This re-architecture has not only made the service more resilient when traffic spikes to record highs, but also provides a more flexible platform on which to build more features faster, including synchronizing direct messages across devices, Twitter cards that allow Tweets to become richer and contain more content, and a rich search experience that includes stories and users. And more features are coming. This talk will cover some of the lessons learned and changes made to not only grow, but also to become more resilient to world events and less fragile to whales.

TRANSCRIPT

Page 1: #Surgeconf Scaling Twitter to go After the Fail Whale

Scaling Twitter To Go After the Fail Whale

Jonathan Reichhold - Twitter Engineering

Page 2: #Surgeconf Scaling Twitter to go After the Fail Whale

Early Twitter....

Page 3: #Surgeconf Scaling Twitter to go After the Fail Whale

2010 World Cup Challenge

•Tweet and user requests growing exponentially (good problem)

Page 4: #Surgeconf Scaling Twitter to go After the Fail Whale

Load....

Page 5: #Surgeconf Scaling Twitter to go After the Fail Whale

Monolithic Architecture

•Ruby on Rails

•Temporally-sharded MySQL

•Memcached

•~60 engineers

Page 6: #Surgeconf Scaling Twitter to go After the Fail Whale

Stabilize & Understand

•Learn & make improvements

•Don’t just survive

Page 7: #Surgeconf Scaling Twitter to go After the Fail Whale

Be Realistic & Ambitious

•Prioritize what can be fixed and timeframes for doing it

•Sometimes need the duct tape

•Find patterns and improvements for the long term

Page 8: #Surgeconf Scaling Twitter to go After the Fail Whale

A Bad Approach

•Flip switches/branches/other until fixed

http://www.flickr.com/photos/chrism70/1144424032

Page 9: #Surgeconf Scaling Twitter to go After the Fail Whale

Science

Page 10: #Surgeconf Scaling Twitter to go After the Fail Whale

Step 1: Trustworty Data

• https://blog.twitter.com/2013/observability-at-twitter

Page 11: #Surgeconf Scaling Twitter to go After the Fail Whale

Step 2: Set Expectations

•Being on-call is a job and during high stress will burn folks out

•Maintain calm and order

Page 12: #Surgeconf Scaling Twitter to go After the Fail Whale

Post Mortems

•Improvement becomes part of process

•Stress makes system stronger not weaker

Page 13: #Surgeconf Scaling Twitter to go After the Fail Whale

Teamwork

•All of this made possible by amazing team and management

•Culture

Page 14: #Surgeconf Scaling Twitter to go After the Fail Whale

Capacity Planning & Forecast

•Just in time but realistic

•Figure out real buffers

Page 15: #Surgeconf Scaling Twitter to go After the Fail Whale

Longer Term Changes

•Architecture changes take time and changes in organization

Page 16: #Surgeconf Scaling Twitter to go After the Fail Whale

Improve Efficiency•Rails/Ruby -> Scala & JVM

•200-300 RPS -> 10,000-20,000

•Single process per request -> Finagle

Page 17: #Surgeconf Scaling Twitter to go After the Fail Whale

Service Orientation•Make changes

at interface boundary, not in single monolith

•Team interactions simplified

•Core nouns and verbs

Page 18: #Surgeconf Scaling Twitter to go After the Fail Whale

Move out of public cloud

•Flexibility and latency demand at some point

•Hard problem

•Datacenter as failure domain

•Mesos

Page 19: #Surgeconf Scaling Twitter to go After the Fail Whale

Dynamic Configuration

•Update routes and compare live vs dark/new

•Quickly adjust to issues

•Faster and less fragile deploys

Page 20: #Surgeconf Scaling Twitter to go After the Fail Whale

Improve storage

•Gizzard for MySQL

•Improve Memcached

•Storage as a service

•Snowflake IDs

Page 21: #Surgeconf Scaling Twitter to go After the Fail Whale

Development Speed

•Startups live and die by development speed

•Make easier to ship but contain damage

Page 22: #Surgeconf Scaling Twitter to go After the Fail Whale

Conclusion

•Fail whale is now an endangered species

•Went from event driven spikes to pushing continuous reliability improvements where events became trivial

Page 23: #Surgeconf Scaling Twitter to go After the Fail Whale

Tweet Spikes Today• New Tweets per second (TPS) record: 143,199

TPS. Typical day: more than 500 million Tweets sent; average 5,700 TPS. (August 2 at 7:21:50 PDT; August 3 at 11:21:50 JST)

• https://blog.twitter.com/2013/new-tweets-per-second-record-and-how

Page 24: #Surgeconf Scaling Twitter to go After the Fail Whale

Final Thoughts

•Marathon not a sprint. Maintain systems and yourself

•We are hiring to make system even better

Page 25: #Surgeconf Scaling Twitter to go After the Fail Whale

Endangered: Fail Whale Jonathan

Reichhold@jreichhold

Page 26: #Surgeconf Scaling Twitter to go After the Fail Whale

Questions?

•https://blog.twitter.com/2013/new-tweets-per-second-record-and-how

•https://blog.twitter.com/2013/observability-at-twitter