nosql: cassadra vs. hbase

38
YCSB Yahoo! Cloud Serving Benchmark Scalable Distributed Systems Antonio L. Severien [email protected] João Rosa [email protected]

Upload: antonio-severien

Post on 10-May-2015

8.765 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: NoSQL: Cassadra vs. HBase

YCSBYahoo! Cloud Serving Benchmark

Scalable Distributed Systems

Antonio L. Severien [email protected]

João [email protected]

Page 2: NoSQL: Cassadra vs. HBase

Overview

• Distributed Databases• Cassandra• HBase• YCSB General View• YCSB Details• Amazon EC2• YCSB Results• YCSB Future• Conclusions• References

Page 3: NoSQL: Cassadra vs. HBase

Distributed Databases

Traditional RDBMS• ACID transactions• Query language (SQL)• Data tied to the modeling (hard to analyze) • Scalable to a limit

Distributed Databases• Not ACID• Not Relational• Column oriented (key-value)• CAP (Consistency, Availability, Partitioning)• Big Data (Massively scalable)

Page 4: NoSQL: Cassadra vs. HBase

Distributed Databases• Sherpa/PNUTS • BigTable • HBase, Hypertable, HTable • Megastore • Azure • Cassandra • Amazon Web Services • S3, SimpleDB, EBS • CouchDB

• Voldemort • Dynomite • Tokyo• Redis• MongoDB

Page 5: NoSQL: Cassadra vs. HBase

Distributed Databases

• NoSQL Databases have different designs and architecture

CassandraThriftGossipToken ring

HbaseHDFSZookeeperHadoop (MapReduce)

BigTableGFSChubby (Lock Service)MapReduce

Page 6: NoSQL: Cassadra vs. HBase

Cassandra• Highlights

• High availability• Incremental scalability• Eventually consistent• Tradeoffs between consistency and latency• Minimal administration• No SPF (Single Point of Failure)

Page 7: NoSQL: Cassadra vs. HBase

Cassandra• CAP-aware

• Cassandra values Availability and Partitioning tolerance (AP) eventually consistent

• Providing strong Consistency in Cassandra increases latency

• Partitioning• Token oriented

• Explicit Replication• Replication factor ≤ Total nodes

• High level clients• Python, Java, C#, .NET, Scala, Ruby, PHP, Erlang, Haskell…etc• Thrift driver-level interface

Page 8: NoSQL: Cassadra vs. HBase

Cassandra• Data Model

• Cluster: • Machines (nodes) in a logical

Cassandra instance• can contain multiple keyspaces

• Keyspace: • name for ColumnFamilies

• ColumnFamilies: • contain multiple columns each with name, value and timestamp

referenced by row keys.• Analogous to table on RDBMS

• SuperColumns: • columns with subcolumns

• Rows• Columns

keyA Column1 Column2 Column3

keyB Column5 Column6 column10

Column

Byte[] Name

Byte[] Value

I64 Timestamp

Page 9: NoSQL: Cassadra vs. HBase

Cassandra

Partitioning Replication

Page 10: NoSQL: Cassadra vs. HBase

HBase

“HBase is more a datastore than a database”

• It lacks many of the features of RDBMS

• Distributed and scalable big data store.

• Regions model

• Strong consistency

Page 11: NoSQL: Cassadra vs. HBase

HBase

Built on top of Hadoop Distributed Filesystem (HDFS)

Page 12: NoSQL: Cassadra vs. HBase

HBase

• The NameNode is responsible for maintaining the filesystem metadata.

• The DataNodes are responsible for storing HDFS blocks.

Page 13: NoSQL: Cassadra vs. HBase

HBase

• The NameNode is responsible for maintaining the filesystem metadata.

• The DataNodes are responsible for storing HDFS blocks.

Note: In our study case, we only had interest on HDFS layer.

Page 14: NoSQL: Cassadra vs. HBase

HBase

Page 15: NoSQL: Cassadra vs. HBase

HBase

DatanodesNamenode

Page 16: NoSQL: Cassadra vs. HBase

HBase

• Data is stored into HBase tables.

• Tables are made of rows and columns.

• All columns belong to a particular column family.

Important note: All column family members are stored together.

• A query on a column family model has a better performance

Page 17: NoSQL: Cassadra vs. HBase

YCSB General View

• Which is the best NoSQL DB?• How to compare?

• Yahoo! Cloud Serving Benchmark (YCSB)• Benchmarking tool

• Evaluate key-value and cloud DBs performance on a common set of workloads

• Client – an extensible workload generator

• Yahoo! Research• Brian F. Cooper - [email protected]• Joint work with Adam Silberstein, Erwin Tam, Raghu Ramakrishnan

and Russell Sear

Page 18: NoSQL: Cassadra vs. HBase

YCSB Details• How it works?

YCSB Client

DB

Inte

rface

La

yer

Client Threads

StatisticsWork

load

Exe

cuto

r

Cloud Serving Store

Workload file• Read/write mix• Record size• Popularity distribution• …

Command line• DB to use• Workload to use• Target throughput• Number of threads• …

Page 19: NoSQL: Cassadra vs. HBase

YCSB Details

Benchmark Tiers• Performance

• Measure latency/throughput curve• Increase throughput until saturation

• Scalability• Scale up: increase hardware, data size and throughput

proportionally• Elastic speedup: add servers while running a workload

Page 20: NoSQL: Cassadra vs. HBase

YCSB Details

Load phase

- Load the database$ ycsb load cassandra-10 –p hosts=127.0.0.1 –P workloadX

Transactions phase

- Executes the workload$ ycsb run cassandra-10 –p hosts=127.0.0.1 –P workloadX

Random Load Distribution

Page 21: NoSQL: Cassadra vs. HBase

YCSB Details• # Yahoo! Cloud System Benchmark• # Workload A: Update heavy workload• # Application example: Session store recording recent actions• # • # Read/update ratio: 50/50• # Default data size: 1 KB records (10 fields, 100 bytes each, plus key)

• # Request distribution: zipfian

• recordcount=1000• operationcount=1000• workload=com.yahoo.ycsb.workloads.CoreWorkload

• readallfields=true

• readproportion=0.5• updateproportion=0.5• scanproportion=0• insertproportion=0

• requestdistribution=zipfian

Page 22: NoSQL: Cassadra vs. HBase

YCSB Details• Execution parameters• $ ./bin/ycsb run cassandra-10 –P workloads/workloada –s –threads 10 –target 100 > transactions.dat

[OVERALL],RunTime(ms), 10110[OVERALL],Throughput(ops/sec), 98.91196834817013[UPDATE], Operations, 491[UPDATE], AverageLatency(ms), 0.054989816700611[UPDATE], MinLatency(ms), 0[UPDATE], MaxLatency(ms), 1[UPDATE], 95thPercentileLatency(ms), 1[UPDATE], 99thPercentileLatency(ms), 1[UPDATE], Return=0, 491[UPDATE], 0, 464[UPDATE], 1, 27[UPDATE], 2, 0[UPDATE], 3, 0[UPDATE], 4, 0...

Page 23: NoSQL: Cassadra vs. HBase

YCSB Details• $ ./bin/ycsb run basic -P workloads/workloada -P large.dat -s -threads 10 -target 100 –p measurementtype=timeseries -p timeseries.granularity=2000 > transactions.dat

[OVERALL],RunTime(ms), 10077[OVERALL],Throughput(ops/sec), 9923.58836955443[UPDATE], Operations, 50396[UPDATE], AverageLatency(ms), 0.04339630129375347[UPDATE], MinLatency(ms), 0[UPDATE], MaxLatency(ms), 338[UPDATE], Return=0, 50396[UPDATE], 0, 0.10264765784114054[UPDATE], 2000, 0.026989343690867442[UPDATE], 4000, 0.0352882703777336[UPDATE], 6000, 0.004238958990536277[UPDATE], 8000, 0.052813085033008175[UPDATE], 10000, 0.0[READ], Operations, 49604[READ], AverageLatency(ms), 0.038242883638416256[READ], MinLatency(ms), 0[READ], MaxLatency(ms), 230[READ], Return=0, 49604[READ], 0, 0.08997245741099663[READ], 2000, 0.02207505518763797[READ], 4000, 0.03188493260913297[READ], 6000, 0.004869141813755326[READ], 8000, 0.04355329949238579[READ], 10000, 0.005405405405405406

Page 24: NoSQL: Cassadra vs. HBase

YCSB Details

Status Output

Page 25: NoSQL: Cassadra vs. HBase

Amazon EC2 Configuration

Large Instance

7.5 GB memory4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each)

850 GB instance storage64-bit platform

I/O Performance: HighAPI name: m1.large

Experiment Set-up

Cassandra Cluster3 nodes + 1 node (Elasticity)

Hbase Cluster3 nodes

Page 26: NoSQL: Cassadra vs. HBase

Amazon EC2 Usage

CassandraLoad phase: 60,000,000 records of 1Kb

Page 27: NoSQL: Cassadra vs. HBase

Amazon EC2 Usage

HBaseLoad phase: 60,000,000 records of 1Kb

Page 28: NoSQL: Cassadra vs. HBase

Amazon EC2 UsageLoad phase: 60,000,000 records of 1Kb

CassandraHBase

Page 29: NoSQL: Cassadra vs. HBase

Amazon EC2 UsageLoad phase: 60,000,000 records of 1KbCassandra HBase

Page 30: NoSQL: Cassadra vs. HBase

Amazon EC2 Usage

Transaction phase: - 10,000 records - 1,000,000 operations - 250 threads

Cassandra

Page 31: NoSQL: Cassadra vs. HBase

YCSB Cassandra ResultsUpdate Heavy Workload

(50/50)

0 1,000 2,000 3,000 4,000 5,000 6,0000

10

20

30

40

50

60

Update

Throughput (ops/sec)

Ave

rag

e L

aten

cy (

ms)

0 1,000 2,000 3,000 4,000 5,000 6,0000

10

20

30

40

50

60

Read

Throughput (ops/sec)

Ave

rag

e L

aten

cy (

ms)

Page 32: NoSQL: Cassadra vs. HBase

YCSB HBase Results

471 485 492 507 562 620 635 734 8450.00

0.05

0.10

0.15

0.20

0.25

0.30

Update Hbase 0.90.5

Throughput (ops/sec)

Ave

rag

e L

aten

cy (

ms)

471 485 492 507 562 620 635 734 8450.00

200.00

400.00

600.00

800.00

1000.00

1200.00

Read HBase 0.90.5

Throughput (ops/sec)

Ave

rag

e L

aten

cy (

ms)

Page 33: NoSQL: Cassadra vs. HBase

YCSB Cassandra Results

0 50000 100000 150000 200000 250000 300000 350000 4000000

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

Elasticity Cassandra 1.0

Time miliseconds

Lat

ency

(m

s)

Page 34: NoSQL: Cassadra vs. HBase

YCSB Cassandra Results

0 50000 100000 150000 200000 250000 300000 350000 4000000

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

Elasticity Cassandra 1.0

Time miliseconds

Lat

ency

(m

s)

Page 35: NoSQL: Cassadra vs. HBase

YCSB Future

Provide statistics for:

- Availability

- Replication

Additional Distributed Databases

Currently supported:

Cassandra Mapkeeper

MongoDB Redis

Voldemort Vmware vFabric Gemfire

Hbase

Page 36: NoSQL: Cassadra vs. HBase

Conclusions

• YCSB provides a common ground for benchmarking cloud DB services

• Good for leaning and experimenting with different distributed databases

• Open source, extensible for new databases

• Laboratory with Amazon EC2 provided good insight into setting up cloud services

• Challenges• Installation problems• Hard to follow documentation• Working on distributed environment require lots of configuration

Page 37: NoSQL: Cassadra vs. HBase

References

• YCSB (Yahoo! Cloud Serving Benchmark)• https://github.com/brianfrankcooper/YCSB/wiki

• Yahoo! Research• http://research.yahoo.com/Web_Information_Management/YCSB

• BigTable• http://en.wikipedia.org/wiki/BigTable

• Cassandra • http://wiki.apache.org/cassandra/

• HBase• http://hbase.apache.org/

Page 38: NoSQL: Cassadra vs. HBase

Questions