overprov a tool for cluster overprovisioning detection
TRANSCRIPT
Overprov: A Tool for Cluster Overprovisioning Detection
Del Bao
Problemad_backend cpu.idle uswest2-prod
Problem (2)
bizfeed oldgen gc count a day
Problem (3)generic cassandra byte_percentfree
what does the tool do?
Design Goals
• save cost in the long run
• based on simple rules
• eliminate false positive
• extensible
Code Structure● run()
for cluster_name in clusters:
dt = detector.ClusterOverprovDetector(
product,
ecosystem,
cluster_name,
metric_list,
start,
stop,
signalfx_auth_token
dt.execute()● metric_list
metric_list_cass = [
ModuleClass('overprov.analyzers.cpu_idle_analyzer', 'CpuIdleAnalyzer'),
ModuleClass('overprov.analyzers.cass_gc_count_analyzer', 'CassGcCountAnalyzer'),
ModuleClass('overprov.analyzers.cass_disk_free_analyzer', 'CassDiskFreeAnalyzer'),
]
You can extend it
• create your own analyzer
• pass in your start, stop day
Assumptions
• static check, so the daily/hourly resolution, e.g., p95 is fine.
• cluster is almost well balanced, so take max/min across cluster hosts in a region represents the entire cluster
What it’s Not
• Fleetmiser– Instantaneous autoscale spot fleet for seagull
clusters– a signal of 10 min interval
• Paasta– similar to above, only for paasta service
Demo• virtualenv_run/bin/overprov -p cassandra -c
ad_backend --start 60 --stop 30 -e prod -k ./api_token
• virtualenv_run/bin/overprov -p cassandra -c ad_backend --start 60 --stop 30 -e prod -k ./api_token --debug
• virtualenv_run/bin/overprov -p elasticsearch -c ads144 --start 60 --stop 30 -e prod -k ./api_token
Questions