Эволюция службы эксплуатации «spotify» / Лев Попов (spotify)

Operations Engineering Evolution at SpotifyLev PopovSite Reliability Engineer@nabamx

Who am I?

Lev Popov Service Reliability Engineer in Spotify Joined Spotify in 2014 Previous QIK – Skype – Microsoft

Background in services and networks operations

What is Spotify?

Some Numbers

• Over 60 million MAU (monthly active users)• Over 15 million paying subscribers• Over 30 million tracks• Over 1.5 billion playlists• Over 20.000 songs added per day

Capacity We Own

• 4 Data Centers• Over 7000 bare metal servers• Many different services• Pushing an average of 35GBps to the Internet• 24/7/365

But let's talk about operations

Service

Service

Service

Service

Dev owner

In the beginning was the…Dev owner

Ops owner

Dev owner

Ops owner

Operations team

Dev owner

On-callMonitoring

Build systems

BackupsDBNetworks…

Operations Team in 2011

Thin group of 5 people

• Over 10 million users• Over 2 million paying subscribers• 12 Countries• Over 15 million tracks• Over 400 million playlists• 3 datacenters• Over 1300 servers

Operations Team Now

?• Over 60 million users• Over 15 million paying

subscribers• 58 Countries• Over 30 million tracks• Over 1.5 billion playlists• 4 datacenters• Over 5000 servers

Operations Team Now

No team• Over 60 million users• Over 15 million paying

subscribers• 58 Countries• Over 30 million tracks• Over 1.5 billion playlists• 4 datacenters• Over 5000 servers

Spotify Engineering Culture

How We Scale

• Service oriented architectureSeparate services for separate features

• UNIX waySmall simple programs doing one thing well

• KISS principleSimple applications are easier to scale

How Spotify Works

Scaling Agile

• Squad is similar to a scrum team

• Designed to feel like a small startup

• Self organizing teams• Autonomy to decide

their own way of working

Scaling Agile

ServiceDev owner

Service

Can we scale that?

Service

Dev owner

Ops owner

Service

Dev owner

Ops owner

Operations team

Dev owner

On-callMonitoring

Build systems

BackupsDBNetworks…

Ops in Squads

Ops in Squads Background

Impossible to scale a central operations team• Understaffed• Difficult to find generalists

We believe that operation has to sit close to development

Our bet for autonomy• Break dependencies• End to end responsibility

Timeline

DevDev

Backend InfrastructureI/O

Operations

SRE

Internal IT

Operations in Squads

2008 Early 2011 Mid 2012 Sep 2013

Infrastructure Operations

featuresquad

featuresquad

featuresquad

featuresquad

IOTribe

networksconf mgmt containers

featuresquad

enable + support

product area

Ops in SquadsExpectations

Wait, wait, but what if…

squad

Core SRE

Core SRE

IOTribe

Major Incidents Scalability IssuesSystems Design Problems

Teaching Best Practices in General

squad squad squadsquad

Incident Management

Incident Management

Incident Postmortem

Remediation

Incident ManagerOn-Call

Everybody involved in an incident

Postmortems

• Plan for post-mortems• Keep it close in time• Record the project details• Involve everyone• Get it in writing• Record successes as well as failures• It's not for punishment• Create an action plan• Make it available

On-call follows the sun

StockholmNew York

StockholmNew York

StockholmNew YorkL0

SA Product OwnersL1

SA LeadL2

19 CET

01 EST

19 CET

01 EST

07 CET 07 CET

13 EST13 EST

19 CET

13 EST

Areas of Improvement

Areas of Improvement

• The expectations we place on squads are sometimes unclear

• Communication between feature teams and infrastructure teams

• It’s hard to measure ops in squads success

• Abandoned services and other ownership issues

Thank you.

@[email protected]