monitoring 改造計畫:流程觀點

Post on 09-Jan-2017

1.031 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

William YehDevOps Summit 2016 (2016-07-06)

Monitoring

Monitoring: a Process Perspective

CRT (Current Reality Tree)

#6

#10

( )

best practice

#3 #5

( ) ( )( )

AND

AND

AND

AND

#1#2

#4

#7

#8

#9

#11

Murphy exists

AND

AND

×

AND

DevOps

AND

AND

AND

AND

AND

AND

AND

AND

Op

AND

http://www.slideshare.net/williamyeh/devops-63711710

#6

#10

( )

best practice

#3 #5

( ) ( )( )

AND

AND

AND

AND

#1#2

#4

#7

#8

#9

#11

Murphy exists

AND

AND

×

AND

DevOps

AND

AND

AND

AND

AND

AND

AND

AND

Op

AND

#6

#10

( )

best practice

#3 #5

( ) ( )( )

AND

AND

AND

AND

#1#2

#4

#7

#8

#9

#11

Murphy exists

AND

AND

×

AND

DevOps

AND

AND

AND

AND

AND

AND

AND

AND

Op

AND

#7

#9 #1

#2

#11

#8 #4

Risk management

• Threats• avoid• transfer• mitigate

7

• Opportunities• exploit• enhance• share

👍👎

http://www.slideshare.net/williamyeh/whoscall-realtime-monitoring

William YehDevOps Summit 2016 (2016-07-06)

Monitoring

Monitoring: a Process Perspective

Process Monitoring

Monitoring

Process

?

?

?

?

Monitoring

Monitoring

?

?

Process

Monitoring

#5

Part 2

Efrat Goldratt-Ashlag

Efrat Goldratt-Ashlag

What to changeTo What to changeHow to cause the change

CRT (Current Reality Tree)

DevOps

DevOps

leverage

TOC

CCPM

FRT (Future Reality Tree)

DevOps

leverage

TOC

CCPM

FRT (Future Reality Tree)

TOC

CCPM

Stephen R. Covey

What get measured, get done.

Peter Drucker

Policy

What get measured, get done.

Policy

Policy

PolicyBuy-in

Policy

What to changeTo What to changeHow to cause the change

Adrian Cockcroft

CloudFront ELB API servers MongoDB

Cloud Manager

CloudWatch

log in S3

StatsD

BigQuery

CloudFront ELB API servers MongoDB

Cloud Manager

CloudWatch

log in S3

StatsD

BigQuery

CloudFront ELB API servers MongoDB

Cloud Manager

CloudWatch

log in S3

BigQuery

CloudFront ELB API servers MongoDB

Cloud Manager

CloudWatch

log in S3

BigQuery

http://school.soft-arch.net/blog/125009/change-viewpoint-on-lord-of-rings

Lean Change Canvas

Lean Change Canvas

Commitment Wins/Benefits

Urgency

Target State

Success Criteria

Vision

Communication

Action

Change Recipients

FYI: http://kojenchieh.pixnet.net/blog/post/442550432-firstthing_of_agile_promotionFYI: http://leankit.com/blog/2015/02/lean-change-method/

Monitoring Q1 (brainstorming) 2016-Jan-06Iteration #1

TO DO LIST details

Augmented

Lean Change Canvas

Urgency

Target State

Success Criteria

Vision

Communication

Action

Monitoring Q1 (brainstorming) 2016-Jan-06Iteration #1

What to changeTo What to changeHow to cause the change

Lean Change Canvas

Urgency

Target State

Success Criteria

Vision

Communication

Action

Monitoring Q1 (brainstorming) 2016-Jan-06Iteration #1

Flow

Tech

Monitoring

Buy-inFlow

Buy-inPolicy

Flow

TOC Lean Thinking

CCPM

TOC

Lean Thinking

Value Value stream FlowPull Perfection

http://school.soft-arch.net/blog/115652/devops-a-lean-perspective

“The Three Ways”

Create fast flow of work from Dev into IT Ops. Shorten and amplify feedback loops. Create a culture that simultaneously fosters 2 things: 1. continual experimentation, learning from

failure. 2. repetition and practice is the prerequisite

to mastery.

Create fast flow of work from Dev into IT Ops.

Shorten and amplify feedback loops.

CCPM

Critical ChainProject Management

Flow

TOC Lean Thinking

CCPM

VPC

CloudFront ELB API servers DB

Simplified version

CloudFront ELB API servers DB

ELB API servers DB

Microservices

Simplified version

Flow

Flow

Flow

Flow

Overview

Incomingrequests

APIservers

DB servers

DB serversAPI

servers

Incomingrequests

Overview

Flow

Lean Change Canvas

Urgency

Target State

Success Criteria

Vision

Communication

Action

Monitoring Q1 (brainstorming) 2016-Jan-06Iteration #1

Flow

TOC

Flow TOC

FlowBuy-in

Policy

TechFlow

Lean Change Canvas

Urgency

Target State

Success Criteria

Vision

Communication

Action

Monitoring Q1 (brainstorming) 2016-Jan-06Iteration #1

Tech

Personal Preferences

• Golang

• Microservices

• Composability

• OSS ecosystem

of server technologies

Personal Preferences

• Golang

• Microservices

• Composability

• OSS ecosystem

Runtime dependency

william Ansible

Personal Preferences

• Golang

• Microservices

• Composability

• OSS ecosystem

Scalability

Overhead

Personal Preferences

• Golang

• Microservices

• Composability

• OSS ecosystem

Node/system metrics exporterAWS CloudWatch exporterBlackbox exporterCollectd exporterConsul exporterGraphite exporterHAProxy exporterInfluxDB exporterJMX exporterMemcached exporterMesos task exporterMySQL server exporterSNMP exporterStatsD exporter

cAdvisorDoormanEtcdKubernetes-MesosKubernetesRobustIRCSkyDNSWeave Flux

Aerospike exporterApache exporterBIG-IP exporterBIND exporterCeph exporterCouchDB exporterDjango exporterGoogle's mtail log data extractorHeka dashboard exporterHeka exporterIoT Edison exporterJenkins exporterknxd exporterMeteor JS web framework exporterMinecraft exporter moduleMirth Connect exporterMongoDB exporterMunin exporterNew Relic exporterNginx metric libraryNSQ exporterOpenWeatherMap exporterPassenger exporterPgBouncer exporterPostgreSQL exporterPowerDNS exporterRabbitMQ exporterRabbitMQ Management Plugin exporterRancher exporterRedis exporterRethinkDB exporterrTorrent exporterscollector exporterSMTP/Maildir MDA blackbox proberSpeedtest.net exporterSQL query result set metrics exporterUbiquiti UniFi exporterVarnish exporterZookeeper exporter

CloudFront ELB API servers MongoDB

Cloud Manager

CloudWatch

log in S3

StatsD

BigQuery

ELB API servers MongoDB

Cloud Manager

CloudWatch

Prometheus vs Graphite/StatsD

abs()absent()bottomk()ceil()changes()clamp_max()clamp_min()count_scalar()delta()deriv()drop_common_labels()exp()floor()histogram_quantile()holt_winters()increase()

irate()label_replace()ln()log2()log10()predict_linear()rate()resets()round()scalar()sort()sort_desc()sqrt()time()topk()vector()<aggregation>_over_time()

node_cpu

time

number

node_cpu

time

number

{mode="idle"}

mode

node_cpu {mode="irq"}

node_cpu {instance="10.0.37.12"}{service="web"}{zone="ap-northest-1a"}

sum( irate(

node_netstat_TcpExt_TCPTimeWaitOverflow[1m] )

) by (ec2tag_Service)

countergauge

aggregate

TCP Timeout

node_netstat_TcpExt_TCPTimeWaitOverflow[1m]irate(

node_netstat_TcpExt_TCPTimeWaitOverflow[1m] )

grouping

gaugeaggregate

Memory Used

1 - node_memory_MemFree/node_memory_MemTotalgrouping

avg( 1 - node_memory_MemFree/node_memory_MemTotal

) by (ec2tag_Service)

avg by (ec2tag_Service) ( irate(

node_cpu{job="node", mode="idle"}[1m] )

)

countergauge

aggregate

CPU Utilization

100 - (

* 100)

avg( request_time_summary

) by (ec2tag_Service, quantile)summary

aggregate

Latency

grouping

Customized metricswith Fluentd plugin for Prometheus

Conclusion

#7

#9 #1

#2

#11

#8 #4

PolicyBuy-in

FlowTech

Policy

Buy-in

Flow

Tech

???

Issue tracking

William YehDevOps Summit 2016 (2016-07-06)

Monitoring

Monitoring: a Process Perspective

top related