cassandra and materialized views

1

Cassandra 2.2 and 3.0 new features

DuyHai DOAN Apache Cassandra Technical Evangelist

#VoxxedBerlin @doanduyhai

Datastax

2

•  Founded in April 2010

•  We contribute a lot to Apache Cassandra™

•  400+ customers (25 of the Fortune 100), 450+ employees

•  Headquarter in San Francisco Bay area

•  EU headquarter in London, offices in France and Germany

•  Datastax Enterprise = OSS Cassandra + extra features

Materialized Views (MV)•  Why ? •  Detailed Impl •  Gotchas

Why Materialized Views ?•  Relieve the pain of manual denormalization

CREATE TABLE user( id int PRIMARY KEY, country text, …);CREATE TABLE user_by_country( country text, id int, …, PRIMARY KEY(country, id));

4

CREATE TABLE user_by_country ( country text, id int, firstname text, lastname text, PRIMARY KEY(country, id));

Materialzed View In ActionCREATE MATERIALIZED VIEW user_by_country AS SELECT country, id, firstname, lastnameFROM user WHERE country IS NOT NULL AND id IS NOT NULLPRIMARY KEY(country, id)

5

Materialzed View SyntaxCREATE MATERIALIZED VIEW [IF NOT EXISTS] keyspace_name.view_name

AS SELECT column1, column2, ...

FROM keyspace_name.table_name

WHERE column1 IS NOT NULL AND column2 IS NOT NULL ...

PRIMARY KEY(column1, column2, ...)

Must select all primary key columns of base table

•  IS NOT NULL condition for now •  more complex conditions in future

•  at least all primary key columns of base table (ordering can be different)

•  maximum 1 column NOT pk from base table

6

Materialized Views Demo 7

Materialized View Impl

C*

C*

C*

C*

C* C*

C* C*

UPDATE user SET country=‘FR’

WHERE id=1

① •  send mutation to all replicas•  waiting for ack(s) with CL

8


C*

C*

C*

C*

C* C*

C* C*


WHERE id=1

② Acquire local lock on base table partition

9


C*

C*

C*

C*

C* C*

C* C*


WHERE id=1

③ Local read to fetch current values

SELECT * FROM user WHERE id=1

10


C*

C*

C*

C*

C* C*

C* C*


WHERE id=1

④ Create BatchLog with

•  DELETE FROM user_by_country WHERE country = ‘old_value’

•  INSERT INTO user_by_country(country, id, …) VALUES(‘FR’, 1, ...)

11


C*

C*

C*

C*

C* C*

C* C*


WHERE id=1

⑤ Execute async BatchLog

to paired view replica with CL = ONE

12


C*

C*

C*

C*

C* C*

C* C*


WHERE id=1

⑥ Apply base table updade locallySET COUNTRY=‘FR’

13


C*

C*

C*

C*

C* C*

C* C*


WHERE id=1

⑦ Release local lock

14


C*

C*

C*

C*

C* C*

C* C*


WHERE id=1

⑧ Return ack to coordinator

15


C*

C*

C*

C*

C* C*

C* C*


WHERE id=1

⑨ If CL ack(s)

received, ack client

16

MV Failure Cases: concurrent updates

Read base row (country=‘UK’)•  DELETE FROM mv WHERE

country=‘UK’•  INSERT INTO mv …(country)

VALUES(‘US’)•  Send async BatchLog •  Apply update country=‘US’

1) UPDATE … SET country=‘US’ 2) UPDATE … SET country=‘FR’



VALUES(‘FR’)•  Send async BatchLog •  Apply update country=‘FR’

t0

t1

t2

Without local

lock

17









t0

t1

t2

Without local

lock

18

INSERT INTO mv …(country) VALUES(‘US’)INSERT INTO mv …(country) VALUES(‘FR’)






Read base row (country=‘US’)•  DELETE FROM mv WHERE

country=‘US’•  INSERT INTO mv …(country)


With local lo

ck

🔒

🔓 🔒

🔓 19

MV Failure Cases: failed updates to MV

C*

C*

C*

C*

C* C*

C* C*


WHERE id=1

⑤ Execute async BatchLog

to paired view replica with CL = ONE

✘

MV replica d

own

20

MV Failure Cases: failed updates to MV

C*

C*

C*

C*

C* C*

C* C*


WHERE id=1

BatchLog replay

MV replica u

p

21

Materialized View Performance•  Write performance

•  local lock•  local read-before-write for MV à update contention on partition (most of perf hits) •  local batchlog for MV•  ☞ you only pay this price once whatever number of MV •  for each base table update: mv_count x 2 (DELETE + INSERT) extra mutations

22

Materialized View Performance•  Write performance vs manual denormalization

•  MV better because no client-server network traffic for read-before-write •  MV better because less network traffic for multiple views (client-side BATCH)

•  Makes developer life easier à priceless

23

Materialized View Performance•  Read performance vs secondary index

•  MV better because single node read (secondary index can hit many nodes)•  MV better because single read path (secondary index = read index + read data)

24

Materialized Views Consistency•  Consistency level

•  CL honoured for base table, ONE for MV + local batchlog

•  Weaker consistency guarantees for MV than for base table.

•  Exemple, write at QUORUM •  guarantee that QUORUM replicas of base tables have received write•  guarantee that QUORUM of MV replicas will eventually receive DELETE + INSERT

25

Materialized Views Gotchas•  Beware of hot spots !!!

•  MV user_by_gender 😱

26

Q & A

! "

27

User Define Functions (UDF)•  Why ? •  Detailed Impl •  UDAs •  Gotchas

Rationale•  Push computation server-side

•  save network bandwidth (1000 nodes!)•  simplify client-side code•  provide standard & useful function (sum, avg …)•  accelerate analytics use-case (pre-aggregation for Spark)

29

How to create an UDF ?

CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALL ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURN returnTypeLANGUAGE language AS $$ // source code here$$;

30


CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURN returnTypeLANGUAGE language AS $$ // source code here$$;

An UDF is keyspace-wide

31


CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURN returnTypeLANGUAGE languageAS $$ // source code here$$;

Param name to refer to in the code Type = CQL3 type

32


CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURN returnTypeLANGUAGE language // jAS $$ // source code here$$;

Always called Null-check mandatory in code

33


CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURN returnTypeLANGUAGE language // javAS $$ // source code here$$;

If any input is null, code block is skipped and return null

34


CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURN returnType LANGUAGE languageAS $$ // source code here$$;

CQL types •  primitives (boolean, int, …) •  collections (list, set, map) •  tuples •  UDT

35


CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURN returnTypeLANGUAGE languageAS $$ // source code here$$; JVM supported languages

•  Java, Scala •  Javascript (slow) •  Groovy, Jython, JRuby •  Clojure ( JSR 223 impl issue)

36


CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURN returnTypeLANGUAGE languageAS $$ // source code here$$;

37

UDF Demo

38

UDA•  Real use-case for UDF

•  Aggregation server-side à huge network bandwidth saving

•  Provide similar behavior for Group By, Sum, Avg etc …

39

How to create an UDA ?

CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS][keyspace.]aggregateName(type1, type2, …)SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction]INITCOND initCond;

Only type, no param name

State type

Initial state type

40



Accumulator function. Signature: accumulatorFunction(stateType, type1, type2, …)

RETURNS stateType

41



Optional final function. Signature: finalFunction(stateType)

42



UDA return type ? If finalFunction •  return type of finalFunction Else •  return stateType

43

UDA Demo

44

Gotchas

C* C*

C*

C*

UDA ①

② & ③

⑤ ② & ③

② & ③

45

Gotchas

C* C*

C*

C*

UDA ①

② & ③

⑤ ② & ③

② & ③

46

Why do not apply UDF/UDA on replica node ?

Gotchas

C* C*

C*

C*

UDA ①

② & ③

④ •  apply accumulatorFunction•  apply finalFunction

⑤ ② & ③

② & ③

1.  Because of eventual consistency

2.  UDF/UDA applied AFTER last-write-win logic

47

Gotchas

48

•  UDA in Cassandra is not distributed !

•  Execute UDA on a large number of rows (106 for ex.) •  single fat partition•  multiple partitions•  full table scan

•  à Increase client-side timeout•  default Java driver timeout = 12 secs•  JAVA-1033 JIRA for per-request timeout setting

Cassandra UDA or Apache Spark ?

49

Consistency Level

Single/Multiple Partition(s)

Recommended Approach

ONE Single partition UDA with token-aware driver because node local

ONE Multiple partitions Apache Spark because distributed reads

> ONE Single partition UDA because data-locality lost with Spark

> ONE Multiple partitions Apache Spark definitely

Cassandra UDA or Apache Spark ?

50

Consistency Level

Single/Multiple Partition(s)

Recommended Approach

ONE Single partition UDA with token-aware driver because node local

ONE Multiple partitions Apache Spark because distributed reads

> ONE Single partition UDA because data-locality lost with Spark

> ONE Multiple partitions Apache Spark definitely

Q & A

! "

51

New Storage Engine•  Data structure •  Disk space usage

Pre 3.0 data structure

Map<byte[ ], SortedMap<byte[ ], Cell>>

53

CREATE TABLE sensor_data( sensor_id uuid, date timestamp, sensor_type text, sensor_value double, PRIMARY KEY(sensor_id, date));

Pre 3.0 on disk layout

54

RowKey: de305d54-75b4-431b-adb2-eb6b9e546014 => (column=2015-04-27 10:00:00+0100:, value=, timestamp=1430128800) => (column=2015-04-27 10:00:00+0100:sensor_type, value=‘Temperature’, timestamp=1430128800) => (column=2015-04-27 10:00:00+0100:sensor_value, value=23.48, timestamp=1430128800) => (column=2015-04-27 10:01:00+0100:, value=, timestamp=1430128860) => (column=2015-04-27 10:01:00+0100:sensor_type, value=‘Temperature’, timestamp=1430128860) => (column=2015-04-27 10:01:00+0100:sensor_value, value=24.08, timestamp=1430128860)

Clustering values are repeated for each normal column

Full timestamp storage

Cassandra 3.0 data structure

Map<byte[ ], SortedMap<ClusteringColumn, Row>>

55

CREATE TABLE sensor_data( sensor_id uuid, date timestamp, sensor_type text, sensor_value double, PRIMARY KEY(sensor_id, date));

Cassandra 3.0 on disk layout

56

PartitionKey: de305d54-75b4-431b-adb2-eb6b9e546014 => clusteringColumn:2015-04-27 10:00:00+0100 => row_timestamp=1430128800 => (column_value=‘Temperature’, delta_encoded_timestamp=+0) => (column_value=23.48, delta_encoded_timestamp=+0) => clusteringColumn:2015-04-27 10:01:00+0100 => row_timestamp=1430128860 => (column_value=‘Temperature’, delta_encoded_timestamp=+0) => (column_value=24.08, delta_encoded_timestamp=+0)

Delta-encoded timestamp vs row timestamp

Gains

57

•  No clustering value repetition

•  Column labels are stored only once in meta data

•  Delta encoding of timestamp, 8 bytes saved each time

•  Less disk space used

Benchmarks

58

CREATE TABLE events ( id uuid, date timeuuid, prop1 int, prop2 text, prop3 float, PRIMARY KEY(id, date));

106 rows

Small string

Benchmarks

59

CREATE TABLE largetext( key int, prop1 int, prop2 text,PRIMARY KEY(id));

106 rows

Large string (1000)

Benchmarks

60

CREATE TABLE largeclustering( key int, clust text, prop1 int, prop2 set<float>,PRIMARY KEY(id, clust));

106 rows Medium string (100)

50 items

Benchmarks

61

CREATE TABLE events ( id uuid, date timeuuid, prop1 int, prop2 text, prop3 float, PRIMARY KEY(id, date)) WITH COMPACT STORAGE ;

Q & A

! "

62

@doanduyhai

[email protected]

https://academy.datastax.com/

Thank You

63

cassandra and materialized views

Software