principles of chaos engineering

30
Chaos Engineering Hamburg Marvin Hoffmann | Computer Scientist 15.12.2015

Upload: hmarvin

Post on 13-Apr-2017

286 views

Category:

Software


4 download

TRANSCRIPT

Page 1: Principles of Chaos Engineering

Chaos Engineering Hamburg

Marvin Hoffmann | Computer Scientist 15.12.2015

Page 2: Principles of Chaos Engineering

1. AWS Basics and Intro

2. Evolution of Chaos Testing

3. Tooling

4. Chaos Engineering

Agenda

Page 3: Principles of Chaos Engineering

Europe West (Ireland)US East (N. Virginia)

Regions

AZs Instances

AWS Basics

Page 4: Principles of Chaos Engineering

Chaos? - What do we mean?

Page 5: Principles of Chaos Engineering

“A way to improve availability is to install proven hardware and

software, and then leave it alone”Jim Gray

Why Do Computers Stop and What Can Be Done About It?

Page 6: Principles of Chaos Engineering

• Systems need to be reliable

• Nuklear weapon arsenal, heart rate monitoring, World of Warcraft servers, Streaming business

• Third party dependencies (software and hardware)

Be reliable!

Page 7: Principles of Chaos Engineering

DynamoDB Outage US-East

• “… there was a brief network disruption that impacted a portion of DynamoDB’s storage servers.”

• 2:19am until 7:10am PDT

• “There are several other AWS services that use DynamoDB that experienced problems during the event.”

• SQS, EC2 auto scaling, CloudWatch

Source: https://aws.amazon.com/message/5467D2/

Page 8: Principles of Chaos Engineering

• Deployments themselves may cause issues

• Unpredicted behaviour after a change has been rolled out

• Issues during rollback

• Change in client / user behaviour

It’s not always the infrastructure

Page 9: Principles of Chaos Engineering

Evolution of Chaos Testing

Page 10: Principles of Chaos Engineering

Do the simplest thing first

• Prepare for your machines to die

• “Cattle, not pets” (Adrian Cockcroft)

• Resilience through redundancy

• Stateless machines

Page 11: Principles of Chaos Engineering

Deal with infrastructure issues

• Latency between instances

• Package loss

• Ports blocked

• or even outages of an entire AZ

Page 12: Principles of Chaos Engineering

Think big!• Remember that DynamoDB failure?

• Outage of an entire AWS region!

• You’ll need more than one region in the first place

• Re-routing of entire traffic from one region to another

• Any region needs to be able to scale to take the load of two regions

Page 13: Principles of Chaos Engineering

Tooling (meet the Monkeys)

Page 14: Principles of Chaos Engineering

Chaos Monkey

Kills random instances in your account

Page 15: Principles of Chaos Engineering

Chaos Gorilla

Kills a random AZ in your account

Page 16: Principles of Chaos Engineering

Chaos Kong

Kills an entire AWS region in your account

Page 17: Principles of Chaos Engineering

What’s in it?• A compilation of scripts

• Scripts mess with your AWS account

• Thus, they are very AWS specific

• If not on AWS, get inspired and build your toolset around these ideas

• Not a comprehensive toolset

Page 18: Principles of Chaos Engineering

• Latency Monkey

• Conformity Monkey

• Security Monkey

• Doctor Monkey

• 10-18 Monkey

Simian Army

Page 19: Principles of Chaos Engineering

Chaos Engineering

Page 20: Principles of Chaos Engineering

• Systematic approach to Chaos Testing

• Started by Netflix

• Talk about it a lot to attract talent

• Many other companies doing similar things in that field

• Want to grow a community around it

Chaos Engineering

Page 21: Principles of Chaos Engineering

“Experiment on a distributed system in order to build confidence in the system’s capability to withstand

turbulent conditions in production.”Netflix

Page 22: Principles of Chaos Engineering

Four Principles of Chaos Engineering

Page 23: Principles of Chaos Engineering

Know your system

• Operational insight

• What is “normal”? What does a failure look like?

Page 24: Principles of Chaos Engineering

Four Principles of Chaos Engineering

1.Build a hypothesis around steady-state behaviour

Page 25: Principles of Chaos Engineering

The “Happy Path”• Trace through code

where nothing bad happens

• usually testing happens first on the happy path

• Bad things usually happen off the happy path

Source: https://bethtrissel.files.wordpress.com/2014/06/176869567.jpg

Page 26: Principles of Chaos Engineering

Four Principles of Chaos Engineering

1.Build a hypothesis around steady-state behaviour

2.Vary real-world events

Page 27: Principles of Chaos Engineering

Laboratory

• “Works on my machine” (or “works in stage env.”)

Source: http://www.memegasms.com/media/created/vhyfxm.jpg

Page 28: Principles of Chaos Engineering

Four Principles of Chaos Engineering

1.Build a hypothesis around steady-state behaviour

2.Vary real-world events

3.Run experiments in production

Page 29: Principles of Chaos Engineering

Four Principles of Chaos Engineering

1.Build a hypothesis around steady-state behaviour

2.Vary real-world events

3.Run experiments in production

4.Automate experiments to run continuously

Page 30: Principles of Chaos Engineering

Chaos Engineering Culture

• http://principlesofchaos.com

• More resources:

• https://github.com/Netflix/SimianArmy

• https://github.com/Netflix/atlas

• https://www.youtube.com/watch?v=vq4QZ4_YDok