mycassandra: a cloud storage supporting both read heavy and write heavy workloads (systor 2012)

32
+ MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads Shunsuke Nakamura (Tokyo Institute of Technology, NHN Japan) Kazuyuki Shudo (Tokyo Inistitute of Technology) Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)

Upload: shunsuke-nakamura

Post on 15-Jan-2015

2.557 views

Category:

Technology


0 download

DESCRIPTION

This slides are the presentation for SYSTOR2012 at Haifa, Israel. http://www.research.ibm.com/haifa/conferences/systor2012/index.shtml

TRANSCRIPT

Page 1: MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

+

MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads

Shunsuke Nakamura (Tokyo Institute of Technology, NHN Japan) Kazuyuki Shudo (Tokyo Inistitute of Technology)

Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)

Page 2: MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

+Cloud Storage

  NoSQL, Key-Value Storage (KVS), Document-Oriented DB, GraphDB   Example: memcached, Google Bigtable, Amazon Dynamo, Amazon SimpleDB, Apache

Cassandra, Voldemort, Ringo, Vpork, MongoDB, CouchDB, Tokyo Tyrant, Flare, ROMA, kumofs, Kai, Redis, LevelDB, Hadoop HBase, Hypertable, Yahoo! PNUTS, Scalaris, Dynomite, ThruDB, Neo4j, IBM ObjectGrid, Giraph, Oracle Coherence, and the others. (> 100 products)

  Characteristics: “limited functions, massive volume, high performance”   Data access only by primary key

  No luxury features such as join, global transaction

  Scalable to much larger data and number of nodes

Distributed data store processing large amount of data

Page 3: MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

+Design policies of cloud storages

  data model   key/value, multi-dimensional map, document or graph

  performance - write vs. read

  latency vs. persistence   latency – memory and disk utilization   persistence – synchronous vs. asynchronous (snapshot)

  replication – synchronous vs. asynchronous

  consistency between replicas – strong vs. weak

  data partitioning – row vs. column

  distribution – master/slave vs. decentralized

There are many trade-offs.

Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)

Page 4: MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

+MyCassandra focuses on performance trade-off

  data model   key/value vs. multi-dimensional map vs. document vs. graph

  performance - write vs. read

  latency vs. persistence   latency – memory and disk utilization   persistence – synchronous vs. asynchronous (snapshot)

  replication – synchronous vs. asynchronous

  consistency between replicas – strong vs. weak

  data partitioning – row vs. column

  distribution – master/slave vs. decentralized

Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)

Page 5: MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

+Performance trade-off Write-optimized vs. read-optimized

  A cloud storage with persistence is designed to optimize either write or read workload.

  Storage engine determines which workload a cloud storage treats efficiently.

Bigtable, Cassandra, HBase

MySQL, Yahoo! Sherpa

Indexing Log-Structured Merge Tree [P. O’Neil ‘96]

B-Trees [R.Bayer ’70]

Write to disk append random reads, writes

Read to disk random reads + merge random read

Performance write-optimized read-optimized

Storage engine Bigtable clone MySQL Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)

Page 6: MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

+

Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)

Performance trade-off - write-optimized vs. read-optimized -

- mycassandra -

6

 Write latency for write-heavy workload

Yahoo! Cloud Serving Benchmark, SOCC ’10

write-optimized

read-optimized

Better

Page 7: MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

+

Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)

Read latency for read-heavy workload

write-optimized

read-optimized

Better

Performance trade-off - write-optimized vs. read-optimized -

- mycassandra - Yahoo! Cloud Serving Benchmark, SOCC ’10

Page 8: MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

+Research overview

  Contribution: A technique to build a cloud storage performing well with both read and write workloads

  Steps: 1.  MyCassandra: Storage engine enabled Apache Cassandra

2.  MyCassandra Cluster: Heterogeneous cluster with different storage engines

Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)

select

 1. MyCassandra

read-optimized

write-optimized

2. MyCassandra Cluster

read and write-optimized

Page 9: MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

+Apache Cassandra

  Features:   Scalability up to hundreds of servers across multiple racks/datacenters

  High availability without SPOF by adopting a decentralized architecture

  Write-optimized

Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)

Open-sourced by in 2008

A top-level project in

dc1 dc2

dc3

Clustering across multiple racks/DCs Replication strategy based on region

Page 10: MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

+Apache Cassandra

  Consistent Hashing (a decentralized algorithm): Assign identifiers to both nodes and data on its circular ID space.

Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)

A decentralized cloud storage without SPOF

A Z F

N V

key values

hash(key) = Q

Q

ID space

primary

Num of replica := 3

secondary 1

secondary 2

A-Z: hash value

Roles of each node •  Proxy, serving clients •  Primary/secondary data nodes

Page 11: MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

+Apache Cassandra

  O(1) fast write operation   Write an update to disk sequentially

- Fast because of no random I/O to disk - Always writable because of no write-lock

Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)

Write-optimized storage engine, a Bigtable clone

memory

disk

CommitLog

Memtable

SSTable 1 SSTable 2 SSTable 3

write path sync async flush

Only sequential write

<k1, obj (v1+v2)>

<k1, v1>, <k1, v2>

<k1,obj1>

<k1,obj3>

<k1,obj2>

1.  Append an update to CommitLog for persistence

2.  Update Memtable, a map in memory, for quick reading

3.  Acknowledge a client 4.  Asynchronously flush Memtable

to SSTable 5.  Delete flushed data from

CommitLog and Memtable

Page 12: MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

+Apache Cassandra

  Slow read operation   Read data from Memtable and multiple SSTables, and merge them

- Slow because of multiple random I/Os on disk

Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)

Write-optimized storage engine, a Bigtable clone

CommitLog

Memtable

SSTable 1 SSTable 2 SSTable 3

merge

<k1,obj>

<k1,obj1>

<k1,obj2>

<k1,obj3>

memory

disk

multiple random I/Os

Page 13: MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

+ Performance of original Cassandra

  YCSB results show:   Average: write is 9 x as fast as read.

  99.9%ile: write is 43.5 x as fast as read.

Write performance is much higher. N

um

ber

of o

per

atio

ns

Latency (ms)

Better read

avg. 6.16 ms

write

avg. 0.69 ms

write

read

Latency (ms)

99.9 %ile

write: 2.0 ms read: 86.9 ms

Page 14: MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

+

select

 1.MyCassandra

read-optimized

write-optimized

1. Storage Engine Support

Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)

Page 15: MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

+MyCassandra: A modular cloud storage

  Storage engine feature inspired by MySQL   An independent and pluggable component   Perform disk I/O

  A cloud storage can be either write-optimized or read-optimized by selecting storage engine

  Keep Cassandra’s original distribution architecture and data model

Storage engines are supported

Bigtable MySQL Redis …

Decentralized + Storage engine

selectable

Consistent Hashing Gossip Protocol

Decentralized

Bigtable

InnoDB MyISAM Memory …

Storage engine selectable

Page 16: MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)

MyCassandra implementation Cassandra’s original distribution arch.

Storage Engine Interface introduced

Implement Storage Engine

Interface Storage Engine Interface

Page 17: MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)

Performance of each storage engine  storage engines

  Bigtable: write-optimized (original Casssandra 0.7.5)

  MySQL: read-optimized (MySQL 6.0 with InnoDB, JDBC API, stored procedure)

  Redis: in-memory KVS (Redis 2.2.8)

x 11.79

x 9.87

6 nodes -  Crucial’s SSD -  allocate 6GB mem in 8GB

1KB x 36 million data set

workload

Page 18: MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

+2.MyCassandra Cluster

read and write-optimized

2. Heterogeneous cluster of different storage engines

Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)

Page 19: MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

  Replicate data on different storage engine nodes

  Route a query to nodes processing it efficiently   Synchronously route to nodes processing quickly

  Asynchronously route to nodes processing slowly → Exploit each node’s advantage

  Furthermore, maintain consistency between replicas as much as the original Cassandra

Quorum Protocol: (write agreements) + (read agreements) > (num of replicas)

= Guarantee retrieval of the latest data

Consequence: At least one node processes both read and write queries synchronously and quickly

→ In-memory nodes play this role.

Basic idea

W R

W R

sync async

RW

write read

write query

•  W: write-optimized •  R: read-optimized •  RW: in-memory

Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)

Page 20: MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

Cluster design •  W: write-optimized •  R: read-optimized •  RW: in-memory

W R

Cluster configuration (N=3)

RW W

RW gossip

responsible nodes

 Combine nodes with different storage engines   write-optimized (W), read-optimized (R), in-memory (RW)

 Disseminate storage engine types of each nodes   The type is attached to gossip messages

 Place replicas on nodes with different storage engines   Proxy (any node requested) selects the storing nodes

1.  The primary node determined based on the queried key

2.  N -1 secondary nodes with different storage engines

 Multiple nodes share a single server for load balance

W R RW RW

Proxy (any node)

primary secondary2 secondary1

Page 21: MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

Process for a write access •  W: write-optimized •  R: read-optimized •  RW: in-memory

•  Quorum parameters= 3, = = 2

•  Num. of replicas W:RW:R = 1:1:1

……

Proxy

Client

Wait for two ACKs for write and return

Async write

Write for a single record

WW

R R

RW

RW

Nodes storing the record

1) A proxy receives a write query from a client. The proxy routes to nodes storing the record.

2) The proxy waits ACKs. W, RW nodes usually reply quickly.

3-a) If writing succeeds and the proxy receives ACKs, it returns a success message.

3-b) If a data node fails to write, the proxy waits for ACKs including R nodes and returns a success message.

4) After returning, the proxy asynchronously waits ACKs from the remaining nodes.

Write latency: max (W, RW)

Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)

Page 22: MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

Process for a read access •  W: write-optimized •  R: read-optimized •  RW: in-memory

……

Proxy

Client Read for a single record

WW

R R

RW

RW

Nodes storing the record

1) A proxy receives a read query and routes to storing nodes.

2) The proxy waits for ACKs. R and RW nodes reply quickly.

3-a) If returned values are consistent, the proxy returns it.

3-b) If the values are mismatched, the proxy waits for consistent values including W nodes.

4) After returning, the proxy waits from the remaining nodes. If the proxy notices inconsistent values, it asynchronously updates them to the consistent one. Cassandra’s feature ReadRepair does it.

Check consistentcy and return result

Async check consistency

Read latency: max (R, RW)

Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)

•  Quorum parameters= 3, = = 2

•  Num. of replicas W:RW:R = 1:1:1

Page 23: MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

+Performance Evaluation

  Targets   MyCassandra Cluster: 3 different nodes/server x 6 servers   Cassandra: 1 node/server x 6 servers

  Quorum parameters = 3, = = 2

  Storage Engine   Bigtable (W), MySQL / InnoDB (R), Redis (RW)

  Yahoo! Cloud Serving Benchmark (YCSB) [SOCC ’10] 1.  Load data (1KB record, 10 x 100bytes columns) from a YCSB client 2.  Warm up 3.  Run benchmark and measure response times from a client

Demonstrate that a heterogeneous cluster performs well with both read- and write-heavy workloads

Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)

Page 24: MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

+YCSB workloads

Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)

Workload Application example

Operation ratio

Record selection

Write-Only Log Read: 0% Write: 100%

Zipfian

Write-Heavy Session store Read: 50% Write: 50%

Read-Heavy Photo tagging

Read: 95% Write: 5%

Read-Only Cache Read: 100% Write: 0%

Write heavy

Read heavy

Zipfian distribution: the access frequency of each datum is determined by its popularity, not by freshness.

Page 25: MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

0

0.5

1

1.5 avg. write-latency Cassandra MyCassandra Cluster

Write/Read latency (Response time)

0

5

10

15

20

25

30

35

Write-Only Write-Heavy Read-Heavy Read-Only

avg. read-latency

Better

Better - 88.8% - 90.4% - 83.3%

write:100%

read:0%

write:0%

read:100% read:95%

write:5% write:50%

read:50%

(ms)

(ms)

+ 42.5% + 59.5% + 69.5%

+ 0.57ms (max)

max 90.4% lower in read-only workload

Performs well with

MySQL + Redis

- 26.5ms (max)

Page 26: MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

0

5000

10000

15000

20000

25000

Write-Only Write-Heavy Read-Heavy Read-Only

QPS for 40 clients Cassandra MyCassandra Cluster

Throughput

(query/sec)

Read heavy Write heavy

Better

x 4.07 x 11.00 x 2.16

x 0.87

[100:0] [50:50] [5:95] [0:100] [write:read]

•  11.0 times as high as Cassandra in Read-Only workload •  Write performance is comparable with Cassandra.

Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)

Page 27: MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

+Conclusion

 A cloud storage supporting both write-heavy and read-heavy workloads by combining different storage engine nodes.

  MyCassandra Cluster achieved better throughput than the original Cassandra on read heavy workload.

  With a read-heavy workload

  Read latency: 90.4 % lower at most

  Throughput: 11.0 times at most

Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)

Page 28: MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

+Related Work

  Indexing algorithm whose goals include achieving both write and read performance   FD-Tree: Tree Indexing on Flash Disks, VLDB ’10   bLSM: A General Purpose Log Structured Merge Tree, SIGMOD ‘12   Fractal-Tree: It’s implemented in TokuDB (MySQL storage engine)

  Modular data stores:   MySQL   Anvil, SOSP ’09   Cloudy, VLDB ’10   Dynamo, SOSP ’07   Fractured Mirrors:

 MyCassandra, SYSTOR ‘12: read vs. write

Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)

Page 29: MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

+ Discussion 1. the slight higher write latency

  Cassandra   Write to any nodes in N nodes

  MyCassandra Cluster   Write to the specified and nodes

However this cost well worths improving for read performance.

The cause is load balancing.

W R RW

write read write read

Cassandra MyCassandra

Cluster

Sync operation is equally distributed.

Sync operation is fixed.

Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)

Page 30: MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

+Discussion 2. in-memory node Q. Memory overflow

A. In-memory node plays as LRU-like cache. The swapped data is recovered from the other persistent nodes by read repair.

Q. Fault tolerance

A. 1) Write to an alternative node, and if the node is recovered, it resolves inconsistency using values from the node.

2) Asynchronous snapshot (Redis’s feature)

Q. Whole in-memory nodes

A. This case limits capacities in cluster with the memory’s capacity.

Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)

Page 31: MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

+オープンソース化

Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)

Page 32: MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

+

Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)