how logging makes a private cloud a better cloud - openstack最新情報セミナー(2016年12月)

Post on 12-Jan-2017

379 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

How logging makes a private cloud a better cloudDec/01/2016Kentaro SasakiGlobal Operations Department, Rakuten, Inc.

2

Rakuten is …a Tokyo-based e-commerce and Internet company

3

Rakuten EcosystemThe Rakuten Ecosystem and our membership database form the foundation of our business

4

Membership116.52 Million persons

Gross Transaction Volume7.6 Trillion JPY

5

Logging Infrastructure for Private Cloud

6

Private Cloud at Rakuten

7

Timeline of Private Cloud History

Hypervisor: XenOS Instances: 2,000+Management features from scratch

Hypervisor: KVMUse OpenStack API

2015Gen3

2012Gen2

2010Gen1

Hypervisor: VMware ESXiOS Instances: 25,000+Management features from scratch

8

Logging Matters

9

Benefits Logging enables log visualization Get easier to analysis and debugging

From a business point of viewShorten the time spent on troubleshooting Leads to a better Customer Support

10

AssumptionsMessages might be un-manageableIncreasing logs require huge log storage

ConcernsHow to take care of data lossHow to parse data from different sources

11

Log Management

12

High AvailabilityAvailability, Redundancy and Scalability

MaintainabilityMinimum data loss and operation overhead

13

Huge Number of TargetsHundreds of Hypervisors (ESXi & KVM)Tens of thousands of VMs

Cover many sort of logSplunk is suited for log analyticsNeed Time-series DB for performance logs

Splunk

InfluxDB

14

Overview of Our Logging Infrastructure

15

Logging Infrastructure

Event logPerformance log

InfluxDB & Grafana

GoogleCloudStorage

Splunk & PagerDuty

FluentdKafka

Splunk

Kafka

Splunk

Fluentd

Fluentd

Metricbeat

CloudFoundry

16

Event Logging Infrastructure

17

Event Logs in OpenStack

18

Huge Number of log files22 log files in a single clusterManage logs for every Regions & Availability Zones

Manage un-manageable logsCRITICAL message is un-manageableNeed to have strong analytical storage engine

Compo-nent

# Log files

Nova 8Keystone 1Neutron 6Glance 2Cinder 5etc. etc.

2013-02-25 21:05:51 17409 CRITICAL cinder [-] Bad or unexpected response from the storage volume backend API: volume group cinder-volumes doesn't exist...2013-02-25 21:05:51 17409 TRACE cinder VolumeBackendAPIException: Bad or unexpected response from the storage volume backend API: volume group cinder-volumes doesn't exist2013-02-25 21:05:51 17409 TRACE cinder

19

Event Logs in VMware

20

Almost all VMware logsEvent logs from vShpere Warning and error logs from ESXi

SAN storage logsError logs from multi vendor’s SAN storage

21

Log storage for Event logs: Splunk

22

System ConfigurationSplunk v6.4.x (as of Nov 2016)Using Indexer cluster and Search head cluster

Manage huge data150+ GB input size per a day30+ TB indexed data size

Input size / a day

Indexed data size

23

Alerting and Reporting on Splunk

24

OpenStack logs26 alerts16 dashboards for reporting

VMware logs68 alerts12 dashboards for reporting (e.g. Visualize number of errors)

25

Useful alerting functionCollaborate with Pagerduty

Strong analytical engineManage and analyze almost all type of logs Manage un-manageable logs

26

Performance Logging Infrastructure

27

Log Collector Requirements

28

Handle log streamsSupport various log file formatStrong parse engine

User-friendly agentMinimum computation resource usagePluggable Architecture

29

Log Collector: Fluentd, Metricbeat

30

HVs and Storage Performance logs

31

OpenStack Hosts logsUse Fluentd exec plugin for getting nf_conntrack_countMetricbeat v5 for cpu, mem, diskio, filesystem, network

VMware HVs and SAN logsUse In-house Fluentd custom plugin for getting Output to InfluxDB and analyze on Grafana

32

VMs Performance logs from Hypervisors

33

#!/usr/bin/env pythonimport json, libvirtconn = libvirt.openReadOnly()for id in conn.listDomainsID(): dom = conn.lookupByID(id) print(json.dumps({ "uuid": dom.UUIDString(), "name": dom.name(), "id": dom.ID(), "vcpus":dom.vcpus()[0][3], }))

From KVM (OpenStack)Use libvirt Python bindings to build the custom scriptsGenerate json data and use in_tail plugin

From ESXi (VMware)Get logs from vCenter

34

Log streaming: Kafka

35

Kafka SpecsKafka v0.10.0Run on OpenStack and use full SSDs

System Configuration100~500 partitions and 3 replications per topicsMake backup for important logs to GCSTransform to the other Kafka (If necessary)

KafkaGoogle Cloud

Storage

Kafka

36

Log storage for Performance logs:InfluxDB and Grafana

37

InfluxDBRun InfluxDB v1.1.0 on physical serverMultiple post by using Kafka and Fluentd

Grafana72 dashboards for visualizing performance dataAccess to Multiple InfluxDBs via Load balancer

Kafka

Grafana

38

Fluentd - Useful Log CollectorFluentd can handle various log format and be easy to parse logsMinimum resource usage

Redundant systemRealize InfluxDB mirroring by Kafka and FluentdMinimize data loss by transporting logs to Kafka – Additionally use GCS

39

Summary

40

2 logging EngineSplunk for event logs, InfluxDB for performance logs

Cover all of our requirementsEasy for troubleshooting, visualization, analysis and improvement

41

Our logging infra makes our private cloud a

better cloud

top related