prometheus on aws
TRANSCRIPT
Prometheus on AWS
About me
• Mitsuhiro Tanda• Infrastructure Engineer @ GREE• Use Prometheus on AWS (1 year)• Grafana committer• @mtanda
Features
• multi-dimensional data model• flexible query language• pull model over HTTP• service discovery• Prometheus values reliability
AWS Monitoring Problems
• Instance lifecycle is short• Instance is launched/terminated by
ASG• Instance workload is not same
among AZ, …
Why we use Prometheus• multi-dimensional data model & flexible query
language– aggregate metrics by Role/AZ, and compare the result– detect the instance which workload is differ among
the Role
• pull model over HTTP & service discovery– specify monitoring target by Role, ...– easily adapt monitoring target increase
multi-dimensional data model
• record instance metadata to labelskey valueinstance_id i-1234abcdinstance_type ec2, rds, elasticache, elb, …instance_model t2.large, m4.large, c4.large,
r3.large, …region ap-northeast-1, us-east-1, …availability_zone ap-northeast-1a, ap-northeast-1c,
…role (instance tag) web, db, …environment (instance tag) production, staging, …
avg(cpu) by (availability_zone)
cpu{role="web"}
avg(cpu) by (role)
Service Discovery
• auto detect monitoring target• Prometheus provides several SD– ec2_sd, consul_sd, kubernetes_sd, file_sd
• (fundamental feature for Pull architecture)
ec2_sd• detect monitoring target by ec2:DescribeInstances API• specify monitoring target by AZ, Instance Tags, ...• example setting for specifying Web Role target
- job_name: 'job_name' ec2_sd_configs: - region: ap-northeast-1 port: 9100 relabel_configs: - source_labels: [__meta_ec2_tag_Role] regex: web.* action: keep
How we deploy setting
Prometheus(for web)
Prometheus(for db)
Role=web Role=db
pack
upload
deploy
edit
このロゴは Jenkins project (https://jenkins.io/)に帰属します。
CloudWatch support• We store CloudWatch metrics to Prometheus• Don't use cloudwatch_exporter, because it's depend on
Java• Create in-house CloudWatch exporter by aws-sdk-go• Recording timestamp cause some problems
– CloudWatch metrics emission is delayed for several minutes– Prometheus treat the metrics as stale, and drop it– I give up to record timestamp for some metrics
Instance Spec we use• use t2.micro - t2.medium instance• use gp2 EBS, volume size is 50-100GB• If the number of monitoring target is 50-100, t2.medium is
enough to monitor them• I recommend to use t2.small or upper
– t2.micro's memory size is not enough– need to change storage.local.memory-chunks
• Sudden load increase can handled by Burst– t2 Instance burst– EBS(gp2) burst
Disk write workload
Disk usage• calculate per monitoring target instance• We have 150 - 300 metrics per one
instance• scrape interval is 15 seconds• Disk usage becomes approximately
200MB per 1 month
Long term metrics storage
• Prometheus doesn't support summarize metrics like rrdtool • The data size becomes large if you set long retention period• The default retention period is 15 days• Prometheus is not designed for long term metrics storage• To store metrics for a long term
– Use Remote Storage (e.g. Graphite)– Launch another Prometheus for long term storage, and store
summarized metrics data (we create metrics summarize exporter)
Using 1 year• daily operation
– Prometheus workload is very stable– mostly no operation required
• upgrade Prometheus– need to change configuration file due to format change– breaking change will come until version 1.0
• support new monitoring target middleware– create exporter for each middleware– by using Prometheus powerful query, exporter becomes very
simple
Reference URL
• http://www.robustperception.io/automatically-monitoring-ec2-instances/
• http://www.robustperception.io/how-to-have-labels-for-machine-roles/
• http://www.robustperception.io/life-of-a-label/• http://www.slideshare.net/FabianReinartz/prometheus-stora
ge-57557499