nttドコモ様 導入事例 openstack summit 2015 tokyo 講演「after one year of openstack cloud...

Post on 06-Jan-2017

9.972 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Copyright©2015 NTT DOCOMO, INC. All rights reserved.

After One Year of OpenStack Cloud Operation (NTT DOCOMO)

NTT DOCOMO Inc.Ken IgarashiNTT Software

Asako IshigakiNEC

Akihiro Motoki

DOCOMO, INC All Rights Reserved

Ken Igarashi○ Leading OpenStack Project at NTT DOCOMO○ One of the first members of proposing

OpenStack Bare Metal Provisioning (currently called "Ironic") - bit.ly/1stuN2E

Asako Ishigaki○ Engineer, NTT Software ○ Developing OpenStack log collection and

analytics tools.

Akihiro Motoki○ Senior Research Engineer, NEC○ Core developer of Neutron and Horizon.

About Us

2

Copyright©2015 NTT DOCOMO, INC. All rights reserved.

Our Project

organization

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 4

Scalable Test using 100 nodes

(10)

System Design

(8)

Recovery Tests(12)

Racking and Cabling

(14)

24/7 support(14)

User Support(+x)

2014-6 2014-8 2014-11 2015-2 2015-5 2015-112015-8

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 5

o Team Rules (Culture) Focusing on using OpenStack instead of developing OpenStack

Think how to use it. Don’t think OpenStack can’t do XXXX.

Reducing Opex/Promoting Automation Operation tools

• “Anything that a humane needs to do more than twice must be automated.”

Reduce operators by HA and self healing.

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 6

o Tools Ansible, Python, Shell Script

CI/CD

• pep-8• Ansible-lint• Install

Spec Writing

Test

Review

Production

+5200+ deployments

(2015)

2000+ patches(2015)

Deployment

Procedure

Copyright©2015 NTT DOCOMO, INC. All rights reserved.

Operation

HAAutomation

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 8

o OpenStack Configuration(http://bit.ly/1DbJPUO) Double redundancies for hardware Triple redundancies for software

VMVM

VMVMVMVM

MySQL (Galera)

Arbitrator

DB1 DB2

DB3 DB4 VMVMNova

OpenStack APIs

Zabbix

LBLBNeutron Agents

PXE, DNS, DHCP

MaaS

RabbitMQ

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 9

o OpenStack Configuration(http://bit.ly/1DbJPUO) Double redundancies for hardware Triple redundancies for software

VMVM

VMVMVMVM

MySQL (Galera)

Arbitrator

DB1 DB2

DB3 DB4 VMVMNova

OpenStack APIs

Zabbix

LBLBNeutron Agents

PXE, DNS, DHCP

MaaS

RabbitMQ

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 10

o Deployment CMDB Registration

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 11

o Choose playbooks for Ansible Dynamic Inventory

Ansible

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 12

o Deployments Common: network, account, logging, Zabbix agent, drivers/firmware x

37

OpenStack: Nova, Swift, Neutron, ……. x 62 HA Configuration

compileInitial update setup

kernel driver firmware filesystemdevelopment environment

Install HDD Driver

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 13

o Operation x 31 Common: process restart, log correction OpenStack Operation: usage, VM migration/backup, user

add/delete/quota change OpenStack Monitoring: health check tools

perhost instance check• Launch instances on given node(s)• boot succeed, instance log• Metadata retrieval, login prompt, SSH access• Optionally, test volume attach and its read/write access

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 14

o 2015/10/27 4:40pm - 5:20pm Heian (New Takanawa)

What are operators doing behind the Cloud?

Copyright©2015 NTT DOCOMO, INC. All rights reserved.

Monitoring System

monitoring

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 16

o Monitoring System

Weekday daytime

24h / 365d

VMVM…

VMVMSwiftVMVMCinder

VMVMNova

RabbitMQ

Neutron Agents

Data Bases

Fluentd

Elasticsearch

Zabbix

Kibana

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 17

VMVM…

VMVMSwiftVMVMCinder

VMVMNova

RabbitMQ

Neutron Agents Data Bases

Memory CPU Network HDD

General

OpenStack

Monitoring Items Self Healing

1,970 25

3,957 59

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 18

o RabbitMQ Configuration

3 node cluster cluster_partition_handling, autoheal

Monitoring Split Brain check:

• “rabbitmqctl eval '[N||{partitions,N}<-rabbit_mnesia:status()].’”

Port Check (5672, 25672) Process Check

• Beam.smp• Rabbitmq-server

At least one node running(1/3)• {Openstack-RabbitMQ:grpsum["HostG-

RabbitMQ","net.tcp.service[tcp,,25672]",last,0].count(#3,0,"eq")}=3

• {OpenStack-RabbitMQ:grpsum["HostG-RabbitMQ","proc.num[beam.smp]",last,0].count(#3,0,"eq")}=3

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 19

o MySQL Configuration

4 Nodes + 1 Arbitrator

Monitoring Cluster Check

• wsrep_local_recv_queue• wsrep_local_send_queue• wsrep_flow_control_paused• wsrep_local_commits

Arbitrator

LB

R/W

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 20

o MySQL Cluster

Master

Disk

Galera

recv_queuesend_queue

Commit

Disk

Replication

OK

Slave

MySQL

Client

OK

Wait until receive OK from replication

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 21

o MySQL Cluster Freeze

Master

Disk

Galera

recv_queuesend_queue

Commit

Disk

Replication

OK

Slave

MySQL

Client

OK

Wait until receive OK from replication

👿

• Disk Failure: (removed from 😀cluster)

• Disk Speed Throttling : 😢

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 22

DOCOMO, INC All Rights Reserved

○ Prohibit some self-healing actions Do not reboot some OpenStack processes

– neutron-plugin-openvswitch-agent Do not reboot network nodes

– loose network reachability (can’t recreate network namespace)

Prohibited Actions while MySQL Cluster Freeze

23

Solved at Liberty?

All the VMs loose connections

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 24

o Throttling happens during DB backup Limit Backup Node

Backup Method

LBR/W

Limit Backup Node

LOCK TABLES FOR BACKUP (online)

1. Take from cluster(Donor/Desynced)

2. DB lock and do backup(FLUSH TABLES WITH READ

LOCK) 3. Return to cluster

(wsrep_desync=OFF)

– wsrep_local_recv_queue– wsrep_local_commits

Copyright©2015 NTT DOCOMO, INC. All rights reserved.

Log Analytics

Kibana

DOCOMO, INC All Rights Reserved

(1) detect critical system-failure

We have to recover immediately

(2) detect malicious access

We need tonotify users

(3) detect no critical errors

Better to be fixedas soon as possible

(4) find errors/warnings that have no service impact

We want to filter out next time

Purpose of Log Analytics

26

DOCOMO, INC All Rights Reserved

○ e.g.Logs of a dayTotal:

100 GB, 80M linesSum of critical, error and warning logs:

200K linesThe meaningful logs are more restrictive:

(1) 0 critical failure (2) 0 malicious access(3) 6 non-critical failure (4) 6 ignorable failure

0%0%1%

30%

39%

30%

Breakdown of Logs

CriticalErrorWarningInfoDebugOther

Treasure Hunt in The Ocean of Logs

0%

24%23%49%

3%

HW

OS

OpenStack backend

OpenStack

Operation tools

27

DOCOMO, INC All Rights Reserved

○ We analyze logs to enhance our black list and white list.○ Logs found in our black list are sent to Zabbix.

Log Analytics Based on White/Black List

---------------

Logs trash

Zabbix Kibana

--------------------

expand

expand

reduce

analyze…

28

add

addblack list

white list

DOCOMO, INC All Rights Reserved

Log Server

Network Node

Control Node

Compute Node

How to Adopt Black/White List Using Fluentd

Fluentd

Elasticsearch

zabbix_senderfluentd

LB

UTM

• Add “ignorable” flag according to white list

• Put metadata to create graphs from the logs

rsyslog

refer

Zabbix

alerts

Kibana

graph graph

Notify Zabbix according to black list

29

DOCOMO, INC All Rights Reserved

Log Server

How to Adopt Black/White List Using Fluentd

Fluentd

Elasticsearch

zabbix_senderfluentd

1. syslog10:01 crit: hardware failure

path: syslog rsyslog api.log

timestamp: 10:01 10:03 10:04

severity: crit warn ERROR

item: - ids ignore

source_ip: - x.x.x.x -

message: hardware failure

IDS: from x.x.x.x

invalid request format

3. api.log10:04 ERROR: invalid request format

2. rsyslog10:03 warn: IDS: from x.x.x.x

Zabbix

hardwarefailure

Kibana

IDSgraph

critgraph

refer

30

DOCOMO, INC All Rights Reserved

Example of Our White List # with Juno

• Count response codes and understand the trend. That’s enough.

^keystonemiddleware\.auth_token \[\-\] Unable to find authentication token in headers$

• This ERROR means user’s operation was denied due to quota.• It has no impact to our system. Should be INFO log?

^nova\.api\.openstack \[[^\]]*\] Caught error: VolumeSizeExceedsAvailableQuota: Requested volume or snapshot exceeds allowed Gigabytes quota\..*$

• This WARNING is caused by presence of SHUTOFF instances.

• It is commonplace condition. Need to be ignored.

^nova.scheduler.host_manager \[[^\]]+\] Host has more disk space than database expected .*$

31

1

2

3

DOCOMO, INC All Rights Reserved

○ We succeeded in reducing logs to be analyzed. In other words, so many meaningless logs have high log-levels.

Effect of Our White List

Without White List: 160K

With White List: 37

reduce99.98%

32

Today

We can analyze all logs in 2-3 hours a day!

1 year agoWe couldn’t analyze all logs

in a day

DOCOMO, INC All Rights Reserved

Example of Our Black List

• This message indicates disk problem on • Compute node.

^kernel: \[[^\]]*\] XXXXX.*hardware failure\.$

• Corosync needs cleanup its resources.

^pengine: warning: unpack_rsc_op:Processing failed op monitor for .*$

• Fullbackup of mysql failed once.

^mysql_fullbackup\[\d+\]:\sFailed\sto\sMySQL\sfullbackup.*$

33

Warning alert

Information alert

Information alert

1

2

3

DOCOMO, INC All Rights Reserved

Demonstration with Kibana○ 3 dashboards

OpenStack All Logs Error Logs Critical Logs Warning Logs IDS

34

DOCOMO, INC All Rights Reserved

Trademarks○ Kibana is a trademark of Elasticsearch BV, registered in the U.S. and in other countries.○ Elasticsearch is a trademark of Elasticsearch BV, registered in the U.S. and in other countries.○ logstash is a trademark of Elasticsearch BV.

35

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 36

o Presentation - Operation 2015/10/27 4:40pm - 5:20pm Heian (New Takanawa)「 What are operators doing behind the Cloud? 」

o Exhibition NEC Booth(H4)

28(Wed.)10:45-13:00,16:30-18:30, 29(Thu.)   9:00-14:00 NTT Group Booth(S14)

28(Wed.) 13:15-16:15「 Touch and Feel! NTT DOCOMO’s Cloud Operation 」

contact-cloudpf-ml@nttdocomo.com

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 37

NEC NTT

Copyright©2015 NTT DOCOMO, INC. All rights reserved.ご清聴ありがとうございました。

top related