How logging makes a private cloud a better cloudDec/01/2016Kentaro SasakiGlobal Operations Department, Rakuten, Inc.
2
Rakuten is …a Tokyo-based e-commerce and Internet company
3
Rakuten EcosystemThe Rakuten Ecosystem and our membership database form the foundation of our business
4
Membership116.52 Million persons
Gross Transaction Volume7.6 Trillion JPY
5
Logging Infrastructure for Private Cloud
6
Private Cloud at Rakuten
7
Timeline of Private Cloud History
Hypervisor: XenOS Instances: 2,000+Management features from scratch
Hypervisor: KVMUse OpenStack API
2015Gen3
2012Gen2
2010Gen1
Hypervisor: VMware ESXiOS Instances: 25,000+Management features from scratch
8
Logging Matters
9
Benefits Logging enables log visualization Get easier to analysis and debugging
From a business point of viewShorten the time spent on troubleshooting Leads to a better Customer Support
10
AssumptionsMessages might be un-manageableIncreasing logs require huge log storage
ConcernsHow to take care of data lossHow to parse data from different sources
11
Log Management
12
High AvailabilityAvailability, Redundancy and Scalability
MaintainabilityMinimum data loss and operation overhead
13
Huge Number of TargetsHundreds of Hypervisors (ESXi & KVM)Tens of thousands of VMs
Cover many sort of logSplunk is suited for log analyticsNeed Time-series DB for performance logs
Splunk
InfluxDB
14
Overview of Our Logging Infrastructure
15
Logging Infrastructure
Event logPerformance log
InfluxDB & Grafana
GoogleCloudStorage
Splunk & PagerDuty
FluentdKafka
Splunk
Kafka
Splunk
Fluentd
Fluentd
Metricbeat
CloudFoundry
16
Event Logging Infrastructure
17
Event Logs in OpenStack
18
Huge Number of log files22 log files in a single clusterManage logs for every Regions & Availability Zones
Manage un-manageable logsCRITICAL message is un-manageableNeed to have strong analytical storage engine
Compo-nent
# Log files
Nova 8Keystone 1Neutron 6Glance 2Cinder 5etc. etc.
2013-02-25 21:05:51 17409 CRITICAL cinder [-] Bad or unexpected response from the storage volume backend API: volume group cinder-volumes doesn't exist...2013-02-25 21:05:51 17409 TRACE cinder VolumeBackendAPIException: Bad or unexpected response from the storage volume backend API: volume group cinder-volumes doesn't exist2013-02-25 21:05:51 17409 TRACE cinder
19
Event Logs in VMware
20
Almost all VMware logsEvent logs from vShpere Warning and error logs from ESXi
SAN storage logsError logs from multi vendor’s SAN storage
21
Log storage for Event logs: Splunk
22
System ConfigurationSplunk v6.4.x (as of Nov 2016)Using Indexer cluster and Search head cluster
Manage huge data150+ GB input size per a day30+ TB indexed data size
Input size / a day
Indexed data size
23
Alerting and Reporting on Splunk
24
OpenStack logs26 alerts16 dashboards for reporting
VMware logs68 alerts12 dashboards for reporting (e.g. Visualize number of errors)
25
Useful alerting functionCollaborate with Pagerduty
Strong analytical engineManage and analyze almost all type of logs Manage un-manageable logs
26
Performance Logging Infrastructure
27
Log Collector Requirements
28
Handle log streamsSupport various log file formatStrong parse engine
User-friendly agentMinimum computation resource usagePluggable Architecture
29
Log Collector: Fluentd, Metricbeat
30
HVs and Storage Performance logs
31
OpenStack Hosts logsUse Fluentd exec plugin for getting nf_conntrack_countMetricbeat v5 for cpu, mem, diskio, filesystem, network
VMware HVs and SAN logsUse In-house Fluentd custom plugin for getting Output to InfluxDB and analyze on Grafana
32
VMs Performance logs from Hypervisors
33
#!/usr/bin/env pythonimport json, libvirtconn = libvirt.openReadOnly()for id in conn.listDomainsID(): dom = conn.lookupByID(id) print(json.dumps({ "uuid": dom.UUIDString(), "name": dom.name(), "id": dom.ID(), "vcpus":dom.vcpus()[0][3], }))
From KVM (OpenStack)Use libvirt Python bindings to build the custom scriptsGenerate json data and use in_tail plugin
From ESXi (VMware)Get logs from vCenter
34
Log streaming: Kafka
35
Kafka SpecsKafka v0.10.0Run on OpenStack and use full SSDs
System Configuration100~500 partitions and 3 replications per topicsMake backup for important logs to GCSTransform to the other Kafka (If necessary)
KafkaGoogle Cloud
Storage
Kafka
36
Log storage for Performance logs:InfluxDB and Grafana
37
InfluxDBRun InfluxDB v1.1.0 on physical serverMultiple post by using Kafka and Fluentd
Grafana72 dashboards for visualizing performance dataAccess to Multiple InfluxDBs via Load balancer
Kafka
Grafana
38
Fluentd - Useful Log CollectorFluentd can handle various log format and be easy to parse logsMinimum resource usage
Redundant systemRealize InfluxDB mirroring by Kafka and FluentdMinimize data loss by transporting logs to Kafka – Additionally use GCS
39
Summary
40
2 logging EngineSplunk for event logs, InfluxDB for performance logs
Cover all of our requirementsEasy for troubleshooting, visualization, analysis and improvement
41
Our logging infra makes our private cloud a
better cloud