ceph & rocksdb · rocksdb group commit . thread scalibility 0 10,000 20,000 30,000 40,000...

22
Ceph & RocksDB 변일수(Cloud Storage)

Upload: others

Post on 21-May-2020

17 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Ceph & RocksDB · RocksDB Group Commit . Thread Scalibility 0 10,000 20,000 30,000 40,000 50,000 60,000 1 shard 10 shard S Shard Scalability WALdisableWAL PUT PUT PUT PUT WAL RocksDB

Ceph & RocksDB

변일수(Cloud Storage팀)

Page 2: Ceph & RocksDB · RocksDB Group Commit . Thread Scalibility 0 10,000 20,000 30,000 40,000 50,000 60,000 1 shard 10 shard S Shard Scalability WALdisableWAL PUT PUT PUT PUT WAL RocksDB

Ceph Basics

Page 3: Ceph & RocksDB · RocksDB Group Commit . Thread Scalibility 0 10,000 20,000 30,000 40,000 50,000 60,000 1 shard 10 shard S Shard Scalability WALdisableWAL PUT PUT PUT PUT WAL RocksDB

Placement Group

myobject mypool

PG#1 PG#2 PG#3

hash(myobject) = 4 % 3(# of PGs) = 1 ← Target PG

Page 4: Ceph & RocksDB · RocksDB Group Commit . Thread Scalibility 0 10,000 20,000 30,000 40,000 50,000 60,000 1 shard 10 shard S Shard Scalability WALdisableWAL PUT PUT PUT PUT WAL RocksDB

CRUSH

mypool

PG#1 PG#2 PG#3

OSD#1 OSD#3 OSD#12

Page 5: Ceph & RocksDB · RocksDB Group Commit . Thread Scalibility 0 10,000 20,000 30,000 40,000 50,000 60,000 1 shard 10 shard S Shard Scalability WALdisableWAL PUT PUT PUT PUT WAL RocksDB

Recovery

mypool

PG#1 PG#2 PG#3

OSD#1 OSD#3 OSD#12

Page 6: Ceph & RocksDB · RocksDB Group Commit . Thread Scalibility 0 10,000 20,000 30,000 40,000 50,000 60,000 1 shard 10 shard S Shard Scalability WALdisableWAL PUT PUT PUT PUT WAL RocksDB

OSD

ObjectStore

Replication, ???, …

OSD

Peering, Heartbeat,

FileStore

BlueStore

https://www.scan.co.uk/products/4tb-toshiba-mg04aca400e-enterprise-hard-drive-35-hdd-sata-iii-6gb-s-7200rpm-128mb-cache-oem

Page 7: Ceph & RocksDB · RocksDB Group Commit . Thread Scalibility 0 10,000 20,000 30,000 40,000 50,000 60,000 1 shard 10 shard S Shard Scalability WALdisableWAL PUT PUT PUT PUT WAL RocksDB

ObjectStore

https://ceph.com/community/new-luminous-bluestore/

Page 8: Ceph & RocksDB · RocksDB Group Commit . Thread Scalibility 0 10,000 20,000 30,000 40,000 50,000 60,000 1 shard 10 shard S Shard Scalability WALdisableWAL PUT PUT PUT PUT WAL RocksDB

OSD Transaction

Page 9: Ceph & RocksDB · RocksDB Group Commit . Thread Scalibility 0 10,000 20,000 30,000 40,000 50,000 60,000 1 shard 10 shard S Shard Scalability WALdisableWAL PUT PUT PUT PUT WAL RocksDB

Consistency is enforced here!

CRUSH

Page 10: Ceph & RocksDB · RocksDB Group Commit . Thread Scalibility 0 10,000 20,000 30,000 40,000 50,000 60,000 1 shard 10 shard S Shard Scalability WALdisableWAL PUT PUT PUT PUT WAL RocksDB

SSD

op_w

qkv_committing

pipe

out

_q

Request

Ack

write

flusth

ROCKSDBsync transaction

Shard

finish

er q

ueue

• To maintain ordering within each PG, ordering within each shard should be guaranteed.

BlueStore Transaction

Page 11: Ceph & RocksDB · RocksDB Group Commit . Thread Scalibility 0 10,000 20,000 30,000 40,000 50,000 60,000 1 shard 10 shard S Shard Scalability WALdisableWAL PUT PUT PUT PUT WAL RocksDB

• Metadata is stored in RocksDB.• After storing metadata atomically, data is available to users.

Request

SSTFile

LogfileMemtable

Transaction Log

Flush

JoinBatchGroup (leader)

PreprocessWrite

WriteToWAL

MarkLogsSynced

ExitAsBatchGroupLeader

Write to memtable

#1 Thread #2 Thread

JoinBatchGroupAwaitState

#3 Thread

JoinBatchGroupAwaitState

PreprocessWrite

WriteToWAL

leader

LaunchParallelFollower

MarkLogsSyncedfollower

ExitAsBatchGroupFollower

CompleteParallelWorker

Concurrent write to memtable

Group commit

RocksDB Group Commit

Page 12: Ceph & RocksDB · RocksDB Group Commit . Thread Scalibility 0 10,000 20,000 30,000 40,000 50,000 60,000 1 shard 10 shard S Shard Scalability WALdisableWAL PUT PUT PUT PUT WAL RocksDB

Thread Scalibility

0

10,000

20,000

30,000

40,000

50,000

60,000

1 shard 10 shard

IOPS

Shard Scalability

WAL disableWAL

PUT

PUT

PUT

PUT

WAL

RocksDB

Page 13: Ceph & RocksDB · RocksDB Group Commit . Thread Scalibility 0 10,000 20,000 30,000 40,000 50,000 60,000 1 shard 10 shard S Shard Scalability WALdisableWAL PUT PUT PUT PUT WAL RocksDB

RadosGW

Page 14: Ceph & RocksDB · RocksDB Group Commit . Thread Scalibility 0 10,000 20,000 30,000 40,000 50,000 60,000 1 shard 10 shard S Shard Scalability WALdisableWAL PUT PUT PUT PUT WAL RocksDB

• RadosGW is an application of RADOS

RadosGW

OSD Mon Mgr

Rados

RadosGW CephFS

Page 15: Ceph & RocksDB · RocksDB Group Commit . Thread Scalibility 0 10,000 20,000 30,000 40,000 50,000 60,000 1 shard 10 shard S Shard Scalability WALdisableWAL PUT PUT PUT PUT WAL RocksDB

• All atomic operations depen on RocksDB

RadosGW Transaction

Prepare Index

Write Data

Complete Index

Put Object

RADOS

Index Object Data Object

k vk vk v

k vk vk v

RocksDB

Page 16: Ceph & RocksDB · RocksDB Group Commit . Thread Scalibility 0 10,000 20,000 30,000 40,000 50,000 60,000 1 shard 10 shard S Shard Scalability WALdisableWAL PUT PUT PUT PUT WAL RocksDB

SSD

op_w

qkv_committing

pipe

out

_q

Request

Ack

write

flusth

ROCKSDBsync transaction

Shard

finish

er q

ueue

bstore_shard_finishers= true

• To maintain ordering within each PG, ordering within each shard should be guaranteed.

BlueStore Transaction

Page 17: Ceph & RocksDB · RocksDB Group Commit . Thread Scalibility 0 10,000 20,000 30,000 40,000 50,000 60,000 1 shard 10 shard S Shard Scalability WALdisableWAL PUT PUT PUT PUT WAL RocksDB

Performance Issue

Page 18: Ceph & RocksDB · RocksDB Group Commit . Thread Scalibility 0 10,000 20,000 30,000 40,000 50,000 60,000 1 shard 10 shard S Shard Scalability WALdisableWAL PUT PUT PUT PUT WAL RocksDB

Tail Latency

Page 19: Ceph & RocksDB · RocksDB Group Commit . Thread Scalibility 0 10,000 20,000 30,000 40,000 50,000 60,000 1 shard 10 shard S Shard Scalability WALdisableWAL PUT PUT PUT PUT WAL RocksDB

Performance Metrics

Page 20: Ceph & RocksDB · RocksDB Group Commit . Thread Scalibility 0 10,000 20,000 30,000 40,000 50,000 60,000 1 shard 10 shard S Shard Scalability WALdisableWAL PUT PUT PUT PUT WAL RocksDB

• "SILK: Preventing Latency Spikes in Log-Structured Merge Key-Value Stores" (ATC'19)

RocksDB Compaction Overhead

Page 21: Ceph & RocksDB · RocksDB Group Commit . Thread Scalibility 0 10,000 20,000 30,000 40,000 50,000 60,000 1 shard 10 shard S Shard Scalability WALdisableWAL PUT PUT PUT PUT WAL RocksDB

• Ceph highly depends on RocksDB• Strong consistency of Ceph is implemented using RocksDB transactions

• The performance of ceph also depends on RocksDB• Especially for Small IO

• But RocksDB has some performance Issues• Flushing WAL• Compaction

[email protected]

Conclusions

Page 22: Ceph & RocksDB · RocksDB Group Commit . Thread Scalibility 0 10,000 20,000 30,000 40,000 50,000 60,000 1 shard 10 shard S Shard Scalability WALdisableWAL PUT PUT PUT PUT WAL RocksDB

THANK YOU