prometheus on aws

19
Prometheus on AWS

Upload: mitsuhiro-tanda

Post on 16-Apr-2017

2.567 views

Category:

Technology


5 download

TRANSCRIPT

Page 1: Prometheus on AWS

Prometheus on AWS

Page 2: Prometheus on AWS

About me

• Mitsuhiro Tanda• Infrastructure Engineer @ GREE• Use Prometheus on AWS (1 year)• Grafana committer• @mtanda

Page 3: Prometheus on AWS

Features

• multi-dimensional data model• flexible query language• pull model over HTTP• service discovery• Prometheus values reliability

Page 4: Prometheus on AWS

AWS Monitoring Problems

• Instance lifecycle is short• Instance is launched/terminated by

ASG• Instance workload is not same

among AZ, …

Page 5: Prometheus on AWS

Why we use Prometheus• multi-dimensional data model & flexible query

language– aggregate metrics by Role/AZ, and compare the result– detect the instance which workload is differ among

the Role

• pull model over HTTP & service discovery– specify monitoring target by Role, ...– easily adapt monitoring target increase

Page 6: Prometheus on AWS

multi-dimensional data model

• record instance metadata to labelskey valueinstance_id i-1234abcdinstance_type ec2, rds, elasticache, elb, …instance_model t2.large, m4.large, c4.large,

r3.large, …region ap-northeast-1, us-east-1, …availability_zone ap-northeast-1a, ap-northeast-1c,

…role (instance tag) web, db, …environment (instance tag) production, staging, …

Page 7: Prometheus on AWS

avg(cpu) by (availability_zone)

Page 8: Prometheus on AWS

cpu{role="web"}

Page 9: Prometheus on AWS

avg(cpu) by (role)

Page 10: Prometheus on AWS

Service Discovery

• auto detect monitoring target• Prometheus provides several SD– ec2_sd, consul_sd, kubernetes_sd, file_sd

• (fundamental feature for Pull architecture)

Page 11: Prometheus on AWS

ec2_sd• detect monitoring target by ec2:DescribeInstances API• specify monitoring target by AZ, Instance Tags, ...• example setting for specifying Web Role target

- job_name: 'job_name' ec2_sd_configs: - region: ap-northeast-1 port: 9100 relabel_configs: - source_labels: [__meta_ec2_tag_Role] regex: web.* action: keep

Page 12: Prometheus on AWS

How we deploy setting

Prometheus(for web)

Prometheus(for db)

Role=web Role=db

pack

upload

deploy

edit

このロゴは Jenkins project (https://jenkins.io/)に帰属します。

Page 13: Prometheus on AWS

CloudWatch support• We store CloudWatch metrics to Prometheus• Don't use cloudwatch_exporter, because it's depend on

Java• Create in-house CloudWatch exporter by aws-sdk-go• Recording timestamp cause some problems

– CloudWatch metrics emission is delayed for several minutes– Prometheus treat the metrics as stale, and drop it– I give up to record timestamp for some metrics

Page 14: Prometheus on AWS

Instance Spec we use• use t2.micro - t2.medium instance• use gp2 EBS, volume size is 50-100GB• If the number of monitoring target is 50-100, t2.medium is

enough to monitor them• I recommend to use t2.small or upper

– t2.micro's memory size is not enough– need to change storage.local.memory-chunks

• Sudden load increase can handled by Burst– t2 Instance burst– EBS(gp2) burst

Page 15: Prometheus on AWS

Disk write workload

Page 16: Prometheus on AWS

Disk usage• calculate per monitoring target instance• We have 150 - 300 metrics per one

instance• scrape interval is 15 seconds• Disk usage becomes approximately

200MB per 1 month

Page 17: Prometheus on AWS

Long term metrics storage

• Prometheus doesn't support summarize metrics like rrdtool • The data size becomes large if you set long retention period• The default retention period is 15 days• Prometheus is not designed for long term metrics storage• To store metrics for a long term

– Use Remote Storage (e.g. Graphite)– Launch another Prometheus for long term storage, and store

summarized metrics data (we create metrics summarize exporter)

Page 18: Prometheus on AWS

Using 1 year• daily operation

– Prometheus workload is very stable– mostly no operation required

• upgrade Prometheus– need to change configuration file due to format change– breaking change will come until version 1.0

• support new monitoring target middleware– create exporter for each middleware– by using Prometheus powerful query, exporter becomes very

simple