making sites reliable (как сделать систему надежной) (Павел...
Post on 16-Jun-2015
351 Views
Preview:
TRANSCRIPT
Making Sites ReliableHuman processes to increase reliability
Andrey Tatarinov, Pavel Uvarov
Site Reliability Engineering, Google
Engineering aspects and roles
● Software Development (SWE)○ Designing○ Coding
● Provisioning (Long term preventing of outages) (SWE+SRE)○ Capacity planning○ Automation○ Operations feedback
● Operations (Short term preventing of outages) (SRE)○ Manual response (oncall)○ Site (system) administration
Spreading knowledge
● Mad genius problem● Sociopath problem● Introducing new team member● Shuffling teams● Low "bus factor"
● Medium and large teams 30+ people● Startups have other problems
Sweet spot (or somewhere around)
● TLDR: Speak more to teammates ● Peer-to-peer review in every process● Data driven decision making● Group decision making● Priorities
○ Widespread knowledge○ Predictable quality○ Leveling extremes
● Key point
○ Human processes that scale○ Principles are omnipresent○ Not educational or disciplinary, but part of day-to-day life
Design review
● Each significant change triggers
● Document that captures○ Problem statement○ Requirements○ Proposed solution○ Costs and benefits
■ Development■ Resources
○ Risk factors and solutions
● Design review meeting○ Different experts: PM,
SWEs, SREs○ Different priorities: new
features, architecture sanity, stability
● Result: ○ Widespread knowledge○ Balanced compromise
with new functionality, sanity and reliability
Code review
● Before commit, not after● Style guide
○ "code is good when you can't tell who wrote it"○ Readability
● Peer-to-peer○ No individual ownership○ Each change should be reviewed by other engineer with expertise
in this area● Result
○ Widespread knowledge○ Reliable changes○ Easy to read and modify, consistent code
Knowledge externalization
● Oncall engineer wakes SWE up at 2am ● External knowledge database● Long-term memory
○ Playbook● Short-term memory
○ Alert history○ Hand-off
Manual response
● No L0, L1, L2 operations levels● Anyone with basic knowledge can react on 80~90% of alerts
○ Knowledge is included○ Human is an intellegent executor
■ Prevents feedback loops■ Can identify anomalies■ Feedback for provisioning and automation
● Weekly/Monthly/Quarterly oncall review○ More people are aware○ Top issues identified○ Automation/rearchitecturing planned
● To get into production every service has to comply with● Checklist derived from experience
○ Continuous builds/testing○ Load testing○ Capacity planning○ General health and user-facing monitoring/alerting○ Identified potential issues
■ Monitoring and failover scenarios○ Playbook entries
● Result: service which behaves predictably; any SRE can react on outage
Productionization
Post-mortem
● Each large outage triggers post-mortem and p-m meeting● Document with
○ Impact○ Timeline○ Root cause○ Action items to prevent root cause from happening
● Post-mortem review● Result
○ Widespread knowledge○ This class of root causes is likely not to happen again
Site Reliability Engineering
Reliability
● The ability of a person or system to perform and maintain its functions in routine circumstances, as well as hostile or unexpected circumstances.
Wikipedia ● Unexpected situations cannot be handled by a computer program.
Otherwise it's not unexpected.Common Logic
We will talk about human processes
Site
● Distributed complicated system○ Many machines, switchers, routers, racks, etc○ Lots of data○ Many services○ But not so many people (machines:admins > 4000:1)
● Everything breaks all the time○ Hardware...
■ Fan stopped, bad memory, disk died, ...○ Software...
■ Server not running, wrong version, slow response, ...○ Network...
Site Reliability
● Suppose we have the software to run on the site● It must just work● How to insure it?
○ Engineer the production environment for reliability■ Automate whatever possible
○ Engineer systems and tools that increase reliability■ Monitoring and alerting■ Databases that keep track of machines, tasks and even traffic
○ Handle unexpected situations manually■ Oncall
Automate...
● If a person doesn't make any desicions he/she can be replaced with a script which is much more scalable and reliable
● Automate○ failovers○ load balancing○ response to failures○ routine repairs (reinstalling, draining)
● Determine physical configuration automatically● Expect the unexpected
○ rare events are completely normal
Preventing outages
● In reality the software is always being upgraded● External world is changing too● Preventing problems.
○ Capacity planning & management○ Review design docs○ Prelaunch (onboarding) review○ Adapt production environment to (external) changes
Oncall
● Monitoring○ Every program/server is universally monitored
● Alerting○ Alert rules○ Alert escalation○ Pager
● Oncall "memory"○ Playbook (runbook)○ Oncall hand-off○ Tracking of issues
● Oncall review weekly/monthly -> feedback● Shifts
Monitoring and Alerting
alert rules
variables history
monitoring variables
services
alert (page)primary oncall
secondary oncall
the team
escalation
monitoring system
Preserving oncall "memory"
● Long term "memory"○ Playbook (useful hints to handle alerts)
● Short term "memory"○ Hand-off (description of the shift for the next oncall)○ Tracking recent issues
Carrying the pager. Shifts Are Geographically Distributed
SWE vs SRE
SWE: I want to believe!
Software Engineers are very devout believers. They believe that: ● Their software does not contain bugs● The datacenters and machines are always "up" and gratis● The network has infinite capacity● The speed of light does not apply to their system● Disks never fail and seek times are close to zero msec.● Configuration files do not contain (syntax) errors
SREs are cynics!
Site reliability engineers know that:● All software sucks● Everything always fails, preferably in the most inconvenient order● Complexity is the enemy of reliability● The laws of physics are real, even inside a virtual machine● All processes that contain a manual component are highly failure
prone● Traffic forecasts are bogus
Ouch
Questions?
Thanks!
top related