raft在百度云存储实践 -...
TRANSCRIPT
![Page 1: Raft在百度云存储实践 - open.qiniudn.comopen.qiniudn.com/ecug-2016/raft-in-baidu-storage-practice.pdf · Table Engine File Engine KV Engine Replication Recovery Control Permission](https://reader033.vdocuments.pub/reader033/viewer/2022060706/607223803a51d411872a0165/html5/thumbnails/1.jpg)
Raft在百度云存储实践
王耀
2017/01/15
![Page 2: Raft在百度云存储实践 - open.qiniudn.comopen.qiniudn.com/ecug-2016/raft-in-baidu-storage-practice.pdf · Table Engine File Engine KV Engine Replication Recovery Control Permission](https://reader033.vdocuments.pub/reader033/viewer/2022060706/607223803a51d411872a0165/html5/thumbnails/2.jpg)
About Me
• 王耀
• 百度云高级架构师
• 资深轮子党
• 分布式存储熟练工
![Page 3: Raft在百度云存储实践 - open.qiniudn.comopen.qiniudn.com/ecug-2016/raft-in-baidu-storage-practice.pdf · Table Engine File Engine KV Engine Replication Recovery Control Permission](https://reader033.vdocuments.pub/reader033/viewer/2022060706/607223803a51d411872a0165/html5/thumbnails/3.jpg)
ABC时代的分布式存储
容量 性能 多样性
内容:网页、广告、日志、UGC
类型:文本、图片、视频
形式:结构化、半结构化、非结构化
EB级存储需求
每日新增百PB数据
数据长期备份
高吞吐
低延时
性能横向扩展
![Page 4: Raft在百度云存储实践 - open.qiniudn.comopen.qiniudn.com/ecug-2016/raft-in-baidu-storage-practice.pdf · Table Engine File Engine KV Engine Replication Recovery Control Permission](https://reader033.vdocuments.pub/reader033/viewer/2022060706/607223803a51d411872a0165/html5/thumbnails/4.jpg)
百度云存储
•分类 • 消息队列
• 文件系统
• 块存储
• 对象存储
• 表格存储
![Page 5: Raft在百度云存储实践 - open.qiniudn.comopen.qiniudn.com/ecug-2016/raft-in-baidu-storage-practice.pdf · Table Engine File Engine KV Engine Replication Recovery Control Permission](https://reader033.vdocuments.pub/reader033/viewer/2022060706/607223803a51d411872a0165/html5/thumbnails/5.jpg)
CCDB存储体系
Memory SSD Disk
Replica Block System Raid-like Block System
Table Engine File Engine KV Engine
Replication Recovery Control
Permission Isolation Priority
Table File Object
Hardware
Block
Engine
Distributed
Platform
Interface
![Page 6: Raft在百度云存储实践 - open.qiniudn.comopen.qiniudn.com/ecug-2016/raft-in-baidu-storage-practice.pdf · Table Engine File Engine KV Engine Replication Recovery Control Permission](https://reader033.vdocuments.pub/reader033/viewer/2022060706/607223803a51d411872a0165/html5/thumbnails/6.jpg)
AFS新存储体系
![Page 7: Raft在百度云存储实践 - open.qiniudn.comopen.qiniudn.com/ecug-2016/raft-in-baidu-storage-practice.pdf · Table Engine File Engine KV Engine Replication Recovery Control Permission](https://reader033.vdocuments.pub/reader033/viewer/2022060706/607223803a51d411872a0165/html5/thumbnails/7.jpg)
分布式存储面临的问题
• 如何分片
• 如何复制
• 如何修复 • 节点加入
• 节点离开
• 如何负载均衡
• 如何规避IO慢节点
![Page 8: Raft在百度云存储实践 - open.qiniudn.comopen.qiniudn.com/ecug-2016/raft-in-baidu-storage-practice.pdf · Table Engine File Engine KV Engine Replication Recovery Control Permission](https://reader033.vdocuments.pub/reader033/viewer/2022060706/607223803a51d411872a0165/html5/thumbnails/8.jpg)
![Page 9: Raft在百度云存储实践 - open.qiniudn.comopen.qiniudn.com/ecug-2016/raft-in-baidu-storage-practice.pdf · Table Engine File Engine KV Engine Replication Recovery Control Permission](https://reader033.vdocuments.pub/reader033/viewer/2022060706/607223803a51d411872a0165/html5/thumbnails/9.jpg)
一致性复制协议
• Basic Paxos
• Multi Paxos
• Viewstamped Replication
• QJM
• ZAB
![Page 10: Raft在百度云存储实践 - open.qiniudn.comopen.qiniudn.com/ecug-2016/raft-in-baidu-storage-practice.pdf · Table Engine File Engine KV Engine Replication Recovery Control Permission](https://reader033.vdocuments.pub/reader033/viewer/2022060706/607223803a51d411872a0165/html5/thumbnails/10.jpg)
Raft简介
• Leader Election
• Log Replication
• Membership Change
• Log Compaction
![Page 11: Raft在百度云存储实践 - open.qiniudn.comopen.qiniudn.com/ecug-2016/raft-in-baidu-storage-practice.pdf · Table Engine File Engine KV Engine Replication Recovery Control Permission](https://reader033.vdocuments.pub/reader033/viewer/2022060706/607223803a51d411872a0165/html5/thumbnails/11.jpg)
Raft之Node状态转移
![Page 12: Raft在百度云存储实践 - open.qiniudn.comopen.qiniudn.com/ecug-2016/raft-in-baidu-storage-practice.pdf · Table Engine File Engine KV Engine Replication Recovery Control Permission](https://reader033.vdocuments.pub/reader033/viewer/2022060706/607223803a51d411872a0165/html5/thumbnails/12.jpg)
Raft之Log状态转移
![Page 13: Raft在百度云存储实践 - open.qiniudn.comopen.qiniudn.com/ecug-2016/raft-in-baidu-storage-practice.pdf · Table Engine File Engine KV Engine Replication Recovery Control Permission](https://reader033.vdocuments.pub/reader033/viewer/2022060706/607223803a51d411872a0165/html5/thumbnails/13.jpg)
libraft之复制修复
![Page 14: Raft在百度云存储实践 - open.qiniudn.comopen.qiniudn.com/ecug-2016/raft-in-baidu-storage-practice.pdf · Table Engine File Engine KV Engine Replication Recovery Control Permission](https://reader033.vdocuments.pub/reader033/viewer/2022060706/607223803a51d411872a0165/html5/thumbnails/14.jpg)
raft在分布式存储
• Core Building • Lock
• Block
• Queue
• Table
• File
• NewSQL
![Page 15: Raft在百度云存储实践 - open.qiniudn.comopen.qiniudn.com/ecug-2016/raft-in-baidu-storage-practice.pdf · Table Engine File Engine KV Engine Replication Recovery Control Permission](https://reader033.vdocuments.pub/reader033/viewer/2022060706/607223803a51d411872a0165/html5/thumbnails/15.jpg)
轮子libraft
• 业界现状
– C++实现较少
– 大部分类zk服务
– 功能不完备
– 性能不好
– 测试不充分
• 需求目标 – 高性能
– 通用库
– 自定义storage
– 功能完备
– prevote
– leader transfer
– 测试靠谱
– jepsen test
![Page 16: Raft在百度云存储实践 - open.qiniudn.comopen.qiniudn.com/ecug-2016/raft-in-baidu-storage-practice.pdf · Table Engine File Engine KV Engine Replication Recovery Control Permission](https://reader033.vdocuments.pub/reader033/viewer/2022060706/607223803a51d411872a0165/html5/thumbnails/16.jpg)
libraft之WAL
• 挑战 • WAL的IO隔离
• WAL阻塞Raft状态机
• WAL双写影响吞吐
• 缓存
• 内存缓存最近Entries
• 异步
• WAL异步写
• 批量 • Replicator批量发Entries
• LogStorage批量写Entries
raft
WAL
replicator
![Page 17: Raft在百度云存储实践 - open.qiniudn.comopen.qiniudn.com/ecug-2016/raft-in-baidu-storage-practice.pdf · Table Engine File Engine KV Engine Replication Recovery Control Permission](https://reader033.vdocuments.pub/reader033/viewer/2022060706/607223803a51d411872a0165/html5/thumbnails/17.jpg)
libraft之prevote
•对称网络划分 •对称网络划分
• 增加term会导致leader stepdown
• prevote阻止数据不全节点选主 • 不属于复制组中的节点
• 属于复制组但网络划分的节点
![Page 18: Raft在百度云存储实践 - open.qiniudn.comopen.qiniudn.com/ecug-2016/raft-in-baidu-storage-practice.pdf · Table Engine File Engine KV Engine Replication Recovery Control Permission](https://reader033.vdocuments.pub/reader033/viewer/2022060706/607223803a51d411872a0165/html5/thumbnails/18.jpg)
libraft之leader transfer
• Implement • TimeoutNow
• Case • rebalance
• remove leader
![Page 19: Raft在百度云存储实践 - open.qiniudn.comopen.qiniudn.com/ecug-2016/raft-in-baidu-storage-practice.pdf · Table Engine File Engine KV Engine Replication Recovery Control Permission](https://reader033.vdocuments.pub/reader033/viewer/2022060706/607223803a51d411872a0165/html5/thumbnails/19.jpg)
libraft之tips
• on_snapshot_load开始先清空状态机
• on_apply保证主从执行结果一致
• on_leader_stop保证leader相关任务cancel
• proposal带上term保证非幂等操作的安全
• PeerId增加version机制
![Page 20: Raft在百度云存储实践 - open.qiniudn.comopen.qiniudn.com/ecug-2016/raft-in-baidu-storage-practice.pdf · Table Engine File Engine KV Engine Replication Recovery Control Permission](https://reader033.vdocuments.pub/reader033/viewer/2022060706/607223803a51d411872a0165/html5/thumbnails/20.jpg)
libraft之benchmark
0
100
200
300
400
500
600
512 1024 4096 8192 16384 32768
throughput
raft fio
![Page 21: Raft在百度云存储实践 - open.qiniudn.comopen.qiniudn.com/ecug-2016/raft-in-baidu-storage-practice.pdf · Table Engine File Engine KV Engine Replication Recovery Control Permission](https://reader033.vdocuments.pub/reader033/viewer/2022060706/607223803a51d411872a0165/html5/thumbnails/21.jpg)
CDS简介
• 云磁盘服务 • 为虚机提供可扩展的数据块级存储卷。
• 特性 • 高可靠性
• 高稳定性
• 高性能 vdisk
弹性 快照
多副本
可用性高
![Page 22: Raft在百度云存储实践 - open.qiniudn.comopen.qiniudn.com/ecug-2016/raft-in-baidu-storage-practice.pdf · Table Engine File Engine KV Engine Replication Recovery Control Permission](https://reader033.vdocuments.pub/reader033/viewer/2022060706/607223803a51d411872a0165/html5/thumbnails/22.jpg)
CDS数据模型
• Volume拆Block
• Block聚BlockGroup
![Page 23: Raft在百度云存储实践 - open.qiniudn.comopen.qiniudn.com/ecug-2016/raft-in-baidu-storage-practice.pdf · Table Engine File Engine KV Engine Replication Recovery Control Permission](https://reader033.vdocuments.pub/reader033/viewer/2022060706/607223803a51d411872a0165/html5/thumbnails/23.jpg)
CDS逻辑数据分布
• 两级分布 • Pool
• ReplicaGroup
![Page 24: Raft在百度云存储实践 - open.qiniudn.comopen.qiniudn.com/ecug-2016/raft-in-baidu-storage-practice.pdf · Table Engine File Engine KV Engine Replication Recovery Control Permission](https://reader033.vdocuments.pub/reader033/viewer/2022060706/607223803a51d411872a0165/html5/thumbnails/24.jpg)
CDS物理数据分布
• 五级隔离 • Region
• Zone
• Rack
• Node
• Disk
0 1
Node 1
2 3
0 0 1 1 2 2
3 3 4 4 4
AZ
Node 2 Node 3 Node 4
数据访问
Node 11
11 12
13 14
10 11
12 13
Node 12
10 12
13 14
Node 13
10 11
14
Node 14
Rack1 Rack2 Rack3 Rack4
![Page 25: Raft在百度云存储实践 - open.qiniudn.comopen.qiniudn.com/ecug-2016/raft-in-baidu-storage-practice.pdf · Table Engine File Engine KV Engine Replication Recovery Control Permission](https://reader033.vdocuments.pub/reader033/viewer/2022060706/607223803a51d411872a0165/html5/thumbnails/25.jpg)
CDS架构
![Page 26: Raft在百度云存储实践 - open.qiniudn.comopen.qiniudn.com/ecug-2016/raft-in-baidu-storage-practice.pdf · Table Engine File Engine KV Engine Replication Recovery Control Permission](https://reader033.vdocuments.pub/reader033/viewer/2022060706/607223803a51d411872a0165/html5/thumbnails/26.jpg)
CDS快照
![Page 27: Raft在百度云存储实践 - open.qiniudn.comopen.qiniudn.com/ecug-2016/raft-in-baidu-storage-practice.pdf · Table Engine File Engine KV Engine Replication Recovery Control Permission](https://reader033.vdocuments.pub/reader033/viewer/2022060706/607223803a51d411872a0165/html5/thumbnails/27.jpg)
CDS回滚
![Page 28: Raft在百度云存储实践 - open.qiniudn.comopen.qiniudn.com/ecug-2016/raft-in-baidu-storage-practice.pdf · Table Engine File Engine KV Engine Replication Recovery Control Permission](https://reader033.vdocuments.pub/reader033/viewer/2022060706/607223803a51d411872a0165/html5/thumbnails/28.jpg)
CDS请求长尾优化
• quorum写入优化写请求
• backup request优化读请求
• 定期汇报进行主从均衡和数据迁移
![Page 29: Raft在百度云存储实践 - open.qiniudn.comopen.qiniudn.com/ecug-2016/raft-in-baidu-storage-practice.pdf · Table Engine File Engine KV Engine Replication Recovery Control Permission](https://reader033.vdocuments.pub/reader033/viewer/2022060706/607223803a51d411872a0165/html5/thumbnails/29.jpg)
即将开源
• bthread
• bvar
• baidu-rpc
• libraft
![Page 30: Raft在百度云存储实践 - open.qiniudn.comopen.qiniudn.com/ecug-2016/raft-in-baidu-storage-practice.pdf · Table Engine File Engine KV Engine Replication Recovery Control Permission](https://reader033.vdocuments.pub/reader033/viewer/2022060706/607223803a51d411872a0165/html5/thumbnails/30.jpg)
Q&A