nosql: cassadra vs. hbase

Post on 10-May-2015

8.765 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

YCSBYahoo! Cloud Serving Benchmark

Scalable Distributed Systems

Antonio L. Severien antonio.severien@gmail.com

João RosaJoao.rui.rosa@gmail.com

Overview

• Distributed Databases• Cassandra• HBase• YCSB General View• YCSB Details• Amazon EC2• YCSB Results• YCSB Future• Conclusions• References

Distributed Databases

Traditional RDBMS• ACID transactions• Query language (SQL)• Data tied to the modeling (hard to analyze) • Scalable to a limit

Distributed Databases• Not ACID• Not Relational• Column oriented (key-value)• CAP (Consistency, Availability, Partitioning)• Big Data (Massively scalable)

Distributed Databases• Sherpa/PNUTS • BigTable • HBase, Hypertable, HTable • Megastore • Azure • Cassandra • Amazon Web Services • S3, SimpleDB, EBS • CouchDB

• Voldemort • Dynomite • Tokyo• Redis• MongoDB

Distributed Databases

• NoSQL Databases have different designs and architecture

CassandraThriftGossipToken ring

HbaseHDFSZookeeperHadoop (MapReduce)

BigTableGFSChubby (Lock Service)MapReduce

Cassandra• Highlights

• High availability• Incremental scalability• Eventually consistent• Tradeoffs between consistency and latency• Minimal administration• No SPF (Single Point of Failure)

Cassandra• CAP-aware

• Cassandra values Availability and Partitioning tolerance (AP) eventually consistent

• Providing strong Consistency in Cassandra increases latency

• Partitioning• Token oriented

• Explicit Replication• Replication factor ≤ Total nodes

• High level clients• Python, Java, C#, .NET, Scala, Ruby, PHP, Erlang, Haskell…etc• Thrift driver-level interface

Cassandra• Data Model

• Cluster: • Machines (nodes) in a logical

Cassandra instance• can contain multiple keyspaces

• Keyspace: • name for ColumnFamilies

• ColumnFamilies: • contain multiple columns each with name, value and timestamp

referenced by row keys.• Analogous to table on RDBMS

• SuperColumns: • columns with subcolumns

• Rows• Columns

keyA Column1 Column2 Column3

keyB Column5 Column6 column10

Column

Byte[] Name

Byte[] Value

I64 Timestamp

Cassandra

Partitioning Replication

HBase

“HBase is more a datastore than a database”

• It lacks many of the features of RDBMS

• Distributed and scalable big data store.

• Regions model

• Strong consistency

HBase

Built on top of Hadoop Distributed Filesystem (HDFS)

HBase

• The NameNode is responsible for maintaining the filesystem metadata.

• The DataNodes are responsible for storing HDFS blocks.

HBase

• The NameNode is responsible for maintaining the filesystem metadata.

• The DataNodes are responsible for storing HDFS blocks.

Note: In our study case, we only had interest on HDFS layer.

HBase

HBase

DatanodesNamenode

HBase

• Data is stored into HBase tables.

• Tables are made of rows and columns.

• All columns belong to a particular column family.

Important note: All column family members are stored together.

• A query on a column family model has a better performance

YCSB General View

• Which is the best NoSQL DB?• How to compare?

• Yahoo! Cloud Serving Benchmark (YCSB)• Benchmarking tool

• Evaluate key-value and cloud DBs performance on a common set of workloads

• Client – an extensible workload generator

• Yahoo! Research• Brian F. Cooper - cooperb@yahoo-inc.com• Joint work with Adam Silberstein, Erwin Tam, Raghu Ramakrishnan

and Russell Sear

YCSB Details• How it works?

YCSB Client

DB

Inte

rface

La

yer

Client Threads

StatisticsWork

load

Exe

cuto

r

Cloud Serving Store

Workload file• Read/write mix• Record size• Popularity distribution• …

Command line• DB to use• Workload to use• Target throughput• Number of threads• …

YCSB Details

Benchmark Tiers• Performance

• Measure latency/throughput curve• Increase throughput until saturation

• Scalability• Scale up: increase hardware, data size and throughput

proportionally• Elastic speedup: add servers while running a workload

YCSB Details

Load phase

- Load the database$ ycsb load cassandra-10 –p hosts=127.0.0.1 –P workloadX

Transactions phase

- Executes the workload$ ycsb run cassandra-10 –p hosts=127.0.0.1 –P workloadX

Random Load Distribution

YCSB Details• # Yahoo! Cloud System Benchmark• # Workload A: Update heavy workload• # Application example: Session store recording recent actions• # • # Read/update ratio: 50/50• # Default data size: 1 KB records (10 fields, 100 bytes each, plus key)

• # Request distribution: zipfian

• recordcount=1000• operationcount=1000• workload=com.yahoo.ycsb.workloads.CoreWorkload

• readallfields=true

• readproportion=0.5• updateproportion=0.5• scanproportion=0• insertproportion=0

• requestdistribution=zipfian

YCSB Details• Execution parameters• $ ./bin/ycsb run cassandra-10 –P workloads/workloada –s –threads 10 –target 100 > transactions.dat

[OVERALL],RunTime(ms), 10110[OVERALL],Throughput(ops/sec), 98.91196834817013[UPDATE], Operations, 491[UPDATE], AverageLatency(ms), 0.054989816700611[UPDATE], MinLatency(ms), 0[UPDATE], MaxLatency(ms), 1[UPDATE], 95thPercentileLatency(ms), 1[UPDATE], 99thPercentileLatency(ms), 1[UPDATE], Return=0, 491[UPDATE], 0, 464[UPDATE], 1, 27[UPDATE], 2, 0[UPDATE], 3, 0[UPDATE], 4, 0...

YCSB Details• $ ./bin/ycsb run basic -P workloads/workloada -P large.dat -s -threads 10 -target 100 –p measurementtype=timeseries -p timeseries.granularity=2000 > transactions.dat

[OVERALL],RunTime(ms), 10077[OVERALL],Throughput(ops/sec), 9923.58836955443[UPDATE], Operations, 50396[UPDATE], AverageLatency(ms), 0.04339630129375347[UPDATE], MinLatency(ms), 0[UPDATE], MaxLatency(ms), 338[UPDATE], Return=0, 50396[UPDATE], 0, 0.10264765784114054[UPDATE], 2000, 0.026989343690867442[UPDATE], 4000, 0.0352882703777336[UPDATE], 6000, 0.004238958990536277[UPDATE], 8000, 0.052813085033008175[UPDATE], 10000, 0.0[READ], Operations, 49604[READ], AverageLatency(ms), 0.038242883638416256[READ], MinLatency(ms), 0[READ], MaxLatency(ms), 230[READ], Return=0, 49604[READ], 0, 0.08997245741099663[READ], 2000, 0.02207505518763797[READ], 4000, 0.03188493260913297[READ], 6000, 0.004869141813755326[READ], 8000, 0.04355329949238579[READ], 10000, 0.005405405405405406

YCSB Details

Status Output

Amazon EC2 Configuration

Large Instance

7.5 GB memory4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each)

850 GB instance storage64-bit platform

I/O Performance: HighAPI name: m1.large

Experiment Set-up

Cassandra Cluster3 nodes + 1 node (Elasticity)

Hbase Cluster3 nodes

Amazon EC2 Usage

CassandraLoad phase: 60,000,000 records of 1Kb

Amazon EC2 Usage

HBaseLoad phase: 60,000,000 records of 1Kb

Amazon EC2 UsageLoad phase: 60,000,000 records of 1Kb

CassandraHBase

Amazon EC2 UsageLoad phase: 60,000,000 records of 1KbCassandra HBase

Amazon EC2 Usage

Transaction phase: - 10,000 records - 1,000,000 operations - 250 threads

Cassandra

YCSB Cassandra ResultsUpdate Heavy Workload

(50/50)

0 1,000 2,000 3,000 4,000 5,000 6,0000

10

20

30

40

50

60

Update

Throughput (ops/sec)

Ave

rag

e L

aten

cy (

ms)

0 1,000 2,000 3,000 4,000 5,000 6,0000

10

20

30

40

50

60

Read

Throughput (ops/sec)

Ave

rag

e L

aten

cy (

ms)

YCSB HBase Results

471 485 492 507 562 620 635 734 8450.00

0.05

0.10

0.15

0.20

0.25

0.30

Update Hbase 0.90.5

Throughput (ops/sec)

Ave

rag

e L

aten

cy (

ms)

471 485 492 507 562 620 635 734 8450.00

200.00

400.00

600.00

800.00

1000.00

1200.00

Read HBase 0.90.5

Throughput (ops/sec)

Ave

rag

e L

aten

cy (

ms)

YCSB Cassandra Results

0 50000 100000 150000 200000 250000 300000 350000 4000000

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

Elasticity Cassandra 1.0

Time miliseconds

Lat

ency

(m

s)

YCSB Cassandra Results

0 50000 100000 150000 200000 250000 300000 350000 4000000

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

Elasticity Cassandra 1.0

Time miliseconds

Lat

ency

(m

s)

YCSB Future

Provide statistics for:

- Availability

- Replication

Additional Distributed Databases

Currently supported:

Cassandra Mapkeeper

MongoDB Redis

Voldemort Vmware vFabric Gemfire

Hbase

Conclusions

• YCSB provides a common ground for benchmarking cloud DB services

• Good for leaning and experimenting with different distributed databases

• Open source, extensible for new databases

• Laboratory with Amazon EC2 provided good insight into setting up cloud services

• Challenges• Installation problems• Hard to follow documentation• Working on distributed environment require lots of configuration

References

• YCSB (Yahoo! Cloud Serving Benchmark)• https://github.com/brianfrankcooper/YCSB/wiki

• Yahoo! Research• http://research.yahoo.com/Web_Information_Management/YCSB

• BigTable• http://en.wikipedia.org/wiki/BigTable

• Cassandra • http://wiki.apache.org/cassandra/

• HBase• http://hbase.apache.org/

Questions

top related