Эволюция службы эксплуатации «spotify» / Лев Попов (spotify)
TRANSCRIPT
Operations Engineering Evolution at SpotifyLev PopovSite Reliability Engineer@nabamx
Who am I?
Lev Popov Service Reliability Engineer in Spotify Joined Spotify in 2014 Previous QIK – Skype – Microsoft
Background in services and networks operations
What is Spotify?
Some Numbers
• Over 60 million MAU (monthly active users)• Over 15 million paying subscribers• Over 30 million tracks• Over 1.5 billion playlists• Over 20.000 songs added per day
Capacity We Own
• 4 Data Centers• Over 7000 bare metal servers• Many different services• Pushing an average of 35GBps to the Internet• 24/7/365
But let's talk about operations
Service
Service
Service
Service
Dev owner
In the beginning was the…Dev owner
Ops owner
Dev owner
Ops owner
Operations team
Dev owner
On-callMonitoring
Build systems
BackupsDBNetworks…
Operations Team in 2011
Thin group of 5 people
• Over 10 million users• Over 2 million paying subscribers• 12 Countries• Over 15 million tracks• Over 400 million playlists• 3 datacenters• Over 1300 servers
Operations Team Now
?• Over 60 million users• Over 15 million paying
subscribers• 58 Countries• Over 30 million tracks• Over 1.5 billion playlists• 4 datacenters• Over 5000 servers
Operations Team Now
No team• Over 60 million users• Over 15 million paying
subscribers• 58 Countries• Over 30 million tracks• Over 1.5 billion playlists• 4 datacenters• Over 5000 servers
Spotify Engineering Culture
How We Scale
• Service oriented architectureSeparate services for separate features
• UNIX waySmall simple programs doing one thing well
• KISS principleSimple applications are easier to scale
How Spotify Works
Scaling Agile
• Squad is similar to a scrum team
• Designed to feel like a small startup
• Self organizing teams• Autonomy to decide
their own way of working
Scaling Agile
ServiceDev owner
Service
Can we scale that?
Service
Dev owner
Ops owner
Service
Dev owner
Ops owner
Operations team
Dev owner
On-callMonitoring
Build systems
BackupsDBNetworks…
Ops in Squads
Ops in Squads Background
Impossible to scale a central operations team• Understaffed• Difficult to find generalists
We believe that operation has to sit close to development
Our bet for autonomy• Break dependencies• End to end responsibility
Timeline
DevDev
Backend InfrastructureI/O
Operations
SRE
Internal IT
Operations in Squads
2008 Early 2011 Mid 2012 Sep 2013
Infrastructure Operations
featuresquad
featuresquad
featuresquad
featuresquad
IOTribe
networksconf mgmt containers
featuresquad
enable + support
product area
Ops in SquadsExpectations
Wait, wait, but what if…
squad
Core SRE
Core SRE
IOTribe
Major Incidents Scalability IssuesSystems Design Problems
Teaching Best Practices in General
squad squad squadsquad
Incident Management
Incident Management
Incident Postmortem
Remediation
Incident ManagerOn-Call
Everybody involved in an incident
Postmortems
• Plan for post-mortems• Keep it close in time• Record the project details• Involve everyone• Get it in writing• Record successes as well as failures• It's not for punishment• Create an action plan• Make it available
On-call follows the sun
StockholmNew York
StockholmNew York
StockholmNew YorkL0
SA Product OwnersL1
SA LeadL2
19 CET
01 EST
19 CET
01 EST
07 CET 07 CET
13 EST13 EST
19 CET
13 EST
Areas of Improvement
Areas of Improvement
• The expectations we place on squads are sometimes unclear
• Communication between feature teams and infrastructure teams
• It’s hard to measure ops in squads success
• Abandoned services and other ownership issues
Thank you.