real-time cassandra

43
Real-time Cassandra Richard Low [email protected] @richardalow

Upload: acunu

Post on 24-Jan-2015

2.581 views

Category:

Technology


0 download

DESCRIPTION

Talk given at Denormalised London, 2012-09-20. Discussion of what a real-time system needs to do and why Cassandra is a good fit.

TRANSCRIPT

Page 1: Real-time Cassandra

Real-time Cassandra

Richard [email protected]

@richardalow

Page 2: Real-time Cassandra

Outline

• What is real-time?

• How do databases implement real-time queries?

• Why is Cassandra ideal for real-time applications?

• Writing real-time applications with Cassandra

Page 3: Real-time Cassandra

What is real-time?

Page 4: Real-time Cassandra

“Of or relating to a system in which input data is processed within milliseconds” dictionary.com

“Occurring immediately” webopedia

“Often real-time response times are understood to be in the order of milliseconds and sometimes microseconds” wikipedia

“...the most important requirement of a real-time system is predictability and not performance” wikipedia

“...a time frame that is very brief, appearing to be immediate.” wisegeek.com

Page 5: Real-time Cassandra
Page 6: Real-time Cassandra

Real-time queries

• ‘Give me X’

• ‘How many Y?’

• ‘What is the top K?’

• ‘How many distinct Z from P?’

Page 7: Real-time Cassandra

Real-time definition

• Definition a query is processed in real-time if the time to get the answer is at most a constant times the transfer time plus the round-trip time

tresponse ! C(ttransfer + tping)

Page 8: Real-time Cassandra

Real-time definition

• The more you ask for, the longer it takes

• For small queries, request dominated by round trip time

• No query can take less time than the time to receive it

Page 9: Real-time Cassandra

Real-time definition• Users on faster networks expect a faster

response

• What we mean by real-time is getting faster

Page 10: Real-time Cassandra

Implications• What does this mean for the database?

• Use Google Analytics example

• Simple query:‘How many page views have there been from France in the last 24 hours?’

Page 11: Real-time Cassandra

Requirement

• Response is one number

• With overhead, say ~1KB

• Ping time 1ms

• 10Mbit connection => 1KB in ~1ms

• 2ms total

Page 12: Real-time Cassandra

Solution 1

• grep *.fr /var/log/apache2/*.log

• Suppose have 1M hits an hour => 7GB of logs a day

• Single disk would take 70s

• Need a beefy server to do this

• Needs to grow as your audience grows

Page 13: Real-time Cassandra

Solution 2

• Maintain a counter for each country

• Increment the counter on each hit

• On query just read the counter

• Maybe it is on disk - 5ms seek

• No need to scale speed with traffic

Page 14: Real-time Cassandra

Implications• Real-time queries can only read about as

much data as they send to the requester

• Need to precompute answers

• Store data in a query-centric rather than data-centric view

Page 15: Real-time Cassandra

Age of data

• A real-time query will often need to query new data

• But not necessarily

• Could run batch process pre-compute answers

Page 16: Real-time Cassandra

Solutions

Page 17: Real-time Cassandra

Solutions

• How make sure don’t read any more than you have to?

• Denormalisation

• Organisation of data

• Counters

Page 18: Real-time Cassandra

• Hard drive performance constraints:

• Sequential IO at 100s MB/s

• Seek at 100 IO/s

• Avoid random IO

• Effective block size 1MB

Denormalisation

Page 19: Real-time Cassandra

Denormalisation

• Store items accessed at similar times near to each other

• Involves copying

• Copying isn’t bad

• Storage costs <$100 per TB

Page 20: Real-time Cassandra

Organisation of data

• If read 100 items off disk, ensure they are next to each other

• Saves reading extra data around them and index lookups

Page 21: Real-time Cassandra

Fast range queries• Get me all keys in the range E to I

AFHIMX

[E, I]

Page 22: Real-time Cassandra

Fast range queries• What happens when you insert?

AFHIMX

QAF

HIM

X

G

Q

G[E, I]

vs

Page 23: Real-time Cassandra

Counters• For queries that simply count, increment the

counter

• Implement inc, dec, get

• Store multiple counts e.g. week, day, hour

Page 24: Real-time Cassandra
Page 25: Real-time Cassandra

Cassandra and real-time

• Write optimised

• Fast merging

• Distributed counters

Page 26: Real-time Cassandra

Write optimised

• All writes are sequential on disk

• Each write is written multiple times during compactions

Page 27: Real-time Cassandra

Fast merging

AF

HIM

XQ

GAFHIMX

QG

How get from this: to this?

+

Page 28: Real-time Cassandra

Fast merging

• Write out new ordered SSTable

• When big enough, merge with existing

Page 29: Real-time Cassandra
Page 30: Real-time Cassandra

AFHIMX

QG

AFHIMX

ZQKGFB

ABFGHIKMQXZZ

QKGFB A

FHIMX

Page 31: Real-time Cassandra

How fast?

Page 32: Real-time Cassandra

Distributed counters• Distributed, fault tolerant replicated counters

• No need for distributed locks

• Super fast

Page 33: Real-time Cassandra

Other requirements

Page 34: Real-time Cassandra

What else do we need?

High value getting quick response

Real-time analytics

High cost if service is down

Need high availability

Page 35: Real-time Cassandra

High value getting quick response

Real-time analytics

Need low latency

Need data geographically close

What else do we need?

Page 36: Real-time Cassandra

Cassandra and HA

• No SPOF

• Choose point on consistency and availability curve

• Tuneable consistency

• Replication

• Multi data-centre support

Page 37: Real-time Cassandra

Cassandra and low latency

• Can configure caches

• Can parallelise reads

• Multi-DC support enables world-wide replication

• Can choose lower consistency to avoid round-trips to other DCs

Page 38: Real-time Cassandra

Writing real-time apps with Cassandra

Page 39: Real-time Cassandra

Real-time apps

• Need to write code using a client library

• Design data-model

• If queries change, code changes

Page 40: Real-time Cassandra

Acunu Analytics• Provides simple RESTful interface to

Cassandra counters

• Push processing into ingest phase

CassandraeventAA

counterupdates

Page 41: Real-time Cassandra

Acunu Analytics• Event template, e.g.,

• Specifies “blow-up” strategy according to supported queries

• Need to know basics of query in advance, but not whole thing

select : ["COUNT", "AVG(loadTime)"],type : { time : [TIME(HOUR; MIN; SEC), ?, 0], page : PATH(/), loadTime : [LONG, 0, 0]}

Page 42: Real-time Cassandra

Features• Simple, real-time, incremental analytics

• work done on ingest

• sum, count, distinct, avg, stddev, min-max etc

• time + hierarchy bucketing

• efficient ‘group’ semantics

• works with Apache Cassandra

Page 43: Real-time Cassandra

Summary

• Formalise what real-time means

• Deduced how data must be stored

• Explored how Cassandra has these properties

• Discussed how Acunu Analytics helps when writing real-time apps