our long road to….continuous improvement (devops days boston 2014)

Our long road to…. continuous improvement

Kevin Amorin

BitSight Team

15 yrs in Enterprise & Startups Enterprise Internal IT IT Virtualization Software SaaS low latency, high volume SaaS Big Data, Analytics

team: Issa Ashwash Isaac Boehman Sathya Ragavan Pavel Sadikov

Kevin

starting the journey

it’s a journey not a destination no silver bullet

“always make new mistakes” do not re-invent wheel continuous improvement

find and reduce the bottleneck (Lean)

Greenfield

first 100 days

ask questions and listen hiring (co-ops) Marketing the message

‘devops’

continuous improvement evaluate current condition & improve

evaluate and improve

(simple) value stream mapping minimal waste in: Design, Build & Sustain

find bottleneck, optimize flow, repeat

bottleneck:symptoms or cause?

improvement:simple yet difficult

infrastructure v1

what does this server do?symptom: lost time on debugging/managing products in data center

cause: organic growth, little planning

improvement: redesign infrastructure & process for naming (host+subnet+app), access (vpn), users, deployment

build v1

build is broken!symptom: lost time on broken build debugging

cause: lack of uniform build environment, committed code with minimal testing and review

improvement: centralized build & CI, branching, pull requests, additional unit tests

deployment v1

that system doesn’t have my fix.symptom: lost time debugging on wrong code

cause: lack of revisioned artifacts

improvement: central artifact repo each package with banch, time, commit

communication v1

what new feature?symptom: lost time with misinformation

cause: siloing of information

improvement: chat, single issue tracking & process, representatives in standup & retrospective

infrastructure v2

that system is missing pip module.symptom: lost time debugging misconfigured application & systems

cause: lack of consistency of systems & applications

improvement: config management, monitoring & alerting

build v2

ran in my local system…symptom: code freeze branch would not run end to end. lost time debugging which change caused issue

cause: lack of regression/functional test

improvement: functional tests & require it to run before merge

deployment v2

db is not correct?symptom: database schema/data did not match code

cause: inconsistent process & manual steps on schema/data updates

improvement: db schema management tool + process

deployment v3new server didn’t come up right?symptom: lost time with misconfigured or failed provisioned nodes in AWS

cause: inconsistent semi-automated provisioning steps did not have the flexibility needed for a growing product line

improvement: knife-bs provisioning & deploymenthttp://github.com/CBitLabs/knife-bs

http://github.com/CBitLabs/knife-bs

provisioning research & design

Netflix Asgard: Web interface for application deployments and cloud management in Amazon Web Services (AWS)

Infochimps Ironfan: Chef orchestration layer -- your system diagram come to life

Chef Knife-ec2: plugin gives knife the ability to create, bootstrap, and manage EC2 instances.

knife-bs

cloud provisioning tool build on top of opscode/knife-ec2. Using a description of your infrastructure and stacks (in either YAML or JSON), knife-bs will build correct the stack in the correct environment and bootstrap chef.

Describe: Infrastructure & Application

environmentregion

vpc subnet

stack profile profile

profile naming, type, volumes, raid, sg, raid,

spot_price, eip, ami… run_list

knife-bs examples

knife bs server create ame1.prod \ portal -count 10 -ebs 10 -eip -ami-id=xxxx

knife bs stack create ame1.stag \ hadoop flavor=c3.xlarge -spot-price=1.00

knife bs server delete ame1.stag tt

infrastructure v3

can I grab a cluster?symptom: who is using what, what state is it in?

cause: lack of visibility of ownership & state of application

improvement: infrastructure web UI which overlays org meta-data

http://github.com/CBitLabs/atlas

http://github.com/CBitLabs/atlas

Delivery Pipeline

Lessons Learned

‘devops’ messaging analyze current state use symptom to find cause find solution that fits KISS then scale

@[email protected]

img: Marc Cluet

our long road to….continuous improvement (devops days boston 2014)

Technology