prometheus on aws

Prometheus on AWS

About me

• Mitsuhiro Tanda• Infrastructure Engineer @ GREE• Use Prometheus on AWS (1 year)• Grafana committer• @mtanda

Features

• multi-dimensional data model• flexible query language• pull model over HTTP• service discovery• Prometheus values reliability

AWS Monitoring Problems

• Instance lifecycle is short• Instance is launched/terminated by

ASG• Instance workload is not same

among AZ, …

Why we use Prometheus• multi-dimensional data model & flexible query

language– aggregate metrics by Role/AZ, and compare the result– detect the instance which workload is differ among

the Role

• pull model over HTTP & service discovery– specify monitoring target by Role, ...– easily adapt monitoring target increase

multi-dimensional data model

• record instance metadata to labelskey valueinstance_id i-1234abcdinstance_type ec2, rds, elasticache, elb, …instance_model t2.large, m4.large, c4.large,

r3.large, …region ap-northeast-1, us-east-1, …availability_zone ap-northeast-1a, ap-northeast-1c,

…role (instance tag) web, db, …environment (instance tag) production, staging, …

avg(cpu) by (availability_zone)

cpu{role="web"}

avg(cpu) by (role)

Service Discovery

• auto detect monitoring target• Prometheus provides several SD– ec2_sd, consul_sd, kubernetes_sd, file_sd

• (fundamental feature for Pull architecture)

ec2_sd• detect monitoring target by ec2:DescribeInstances API• specify monitoring target by AZ, Instance Tags, ...• example setting for specifying Web Role target

- job_name: 'job_name' ec2_sd_configs: - region: ap-northeast-1 port: 9100 relabel_configs: - source_labels: [__meta_ec2_tag_Role] regex: web.* action: keep

How we deploy setting

Prometheus(for web)

Prometheus(for db)

Role=web Role=db

pack

upload

deploy

edit

このロゴは Jenkins project (https://jenkins.io/)に帰属します。

CloudWatch support• We store CloudWatch metrics to Prometheus• Don't use cloudwatch_exporter, because it's depend on

Java• Create in-house CloudWatch exporter by aws-sdk-go• Recording timestamp cause some problems

– CloudWatch metrics emission is delayed for several minutes– Prometheus treat the metrics as stale, and drop it– I give up to record timestamp for some metrics

Instance Spec we use• use t2.micro - t2.medium instance• use gp2 EBS, volume size is 50-100GB• If the number of monitoring target is 50-100, t2.medium is

enough to monitor them• I recommend to use t2.small or upper

– t2.micro's memory size is not enough– need to change storage.local.memory-chunks

• Sudden load increase can handled by Burst– t2 Instance burst– EBS(gp2) burst

Disk write workload

Disk usage• calculate per monitoring target instance• We have 150 - 300 metrics per one

instance• scrape interval is 15 seconds• Disk usage becomes approximately

200MB per 1 month

Long term metrics storage

• Prometheus doesn't support summarize metrics like rrdtool • The data size becomes large if you set long retention period• The default retention period is 15 days• Prometheus is not designed for long term metrics storage• To store metrics for a long term

– Use Remote Storage (e.g. Graphite)– Launch another Prometheus for long term storage, and store

summarized metrics data (we create metrics summarize exporter)

Using 1 year• daily operation

– Prometheus workload is very stable– mostly no operation required

• upgrade Prometheus– need to change configuration file due to format change– breaking change will come until version 1.0

• support new monitoring target middleware– create exporter for each middleware– by using Prometheus powerful query, exporter becomes very

simple

Reference URL

• http://www.robustperception.io/automatically-monitoring-ec2-instances/

• http://www.robustperception.io/how-to-have-labels-for-machine-roles/

• http://www.robustperception.io/life-of-a-label/• http://www.slideshare.net/FabianReinartz/prometheus-stora

ge-57557499

http://www.robustperception.io/how-to-have-labels-for-machine-roles/






http://www.robustperception.io/life-of-a-label/

http://www.robustperception.io/life-of-a-label/

http://www.slideshare.net/FabianReinartz/prometheus-storage-57557499



prometheus on aws

Technology