building your own distributed system the easy way - cassandra summit eu 2014
TRANSCRIPT
Building Your Own Distributed SystemThe Easy Way
Kévin Lovato - @alprema
What this presentation will NOT talk about
• Gazillions of inserts per second
• Hundreds of nodes
• Migrations from old technology to C* that now go 100 times faster
What this presentation will talk about
• Servers that synchronize their state
• Out of order messages
• CQL Schema design
• Time measurement madness
Introduction
• Hedge fund specialized in algorithmic trading
• ~80 employees
• Our C* usage• Historical data (6+ Tb)• Time series (Metrics)• Home made Service Bus (Zebus)
Service Bus 101
• Network abstraction layer
• Allows communication between services (SOA)
• Communication is enabled using Business level messages (events)
• Usually relies on a broker
Zebus 101• Developed in .Net
• P2P
• Lightweight
• CQRS oriented
• 1+ year of production experience
• ~150M messages / day
Architecture overview
Terminology• Peer: A program connected to the Bus
• Subscription: A message type a Peer is interested in
• Directory server: A Peer that knows all the Peers and their Subscriptions
Peer 1 Peer 2
Directory 1 Directory 2
Peer 3
Peer 1 is not connected and needs to register on the bus
Peer 1 Peer 2
Directory 1 Directory 2
Peer 3
Register Peers list +Subscriptions
Peer 1 Peer 2
Directory 1 Directory 2
Peer 3
New Peer information
Peer 1 Peer 2
Directory 1 Directory 2
Peer 3
Direct communication
Design requirements
The Directory servers must be identical (no master)
The Directory servers must be identical (no master)
A peer can contact any of the Directory servers at any time
The Directory servers must be identical (no master)
A peer can contact any of the Directory servers at any time
Directory servers can be updated/restarted at any time
The Directory servers must be identical (no master)
A peer can contact any of the Directory servers at any time
Directory servers can be updated/restarted at any time
Peers have to be able to add Subscriptions one at a time if needed
Option 1: Design a resilient distributed system
Option 2: Let Cassandra do the heavy lifting
Pick me!Pick me!
How ?
Make the Directory Servers stateless
I
• Allows to offload state synchronization to Cassandra (Quorum everywhere)
• Makes restart / crash recovery easy
• Only « business » code in the Directory Server
Handle out of order subscriptions
II
Peer 1
Directory 1 Directory 2
Timestamps:Naive implementation (server side)
Peer 1 is already registered on the Bus and will need to do multiple Subscription updates
Peer 1
Directory 1 Directory 2
Subscriptions update A
Timestamps:Naive implementation (server side)
Peer 1
Directory 1 Directory 2
Subscriptions update B
Timestamps:Naive implementation (server side)
Peer 1
Directory 1 Directory 2
A delay (network, slow machine, etc.) causesDirectory 1 to process the update after Directory 2
Timestamps:Naive implementation (server side)
Peer 1
Directory 1 Directory 2
Subscriptions update BTimestamp: 00:00:01
Timestamps:Naive implementation (server side)
Peer 1
Directory 1 Directory 2
Subscriptions update ATimestamp: 00:00:02
Timestamps:Naive implementation (server side)
Peer 1
Directory 1 Directory 2
Stored: Subscriptions update ATimestamps:Naive implementation (server side)
Stored: Subscriptions update ATimestamps:Naive implementation (server side)
Peer 1
Directory 1 Directory 2
Timestamps:Zebus implementation (client side)
Same scenario, but this time using client sidetimestamps
Peer 1
Directory 1 Directory 2
Subscriptions update ATimestamp: 00:00:01
Timestamps:Zebus implementation (client side)
Peer 1
Directory 1 Directory 2
Subscriptions update BTimestamp: 00:00:02
Timestamps:Zebus implementation (client side)
Peer 1
Directory 1 Directory 2
Timestamps:Zebus implementation (client side)
The delay voodoo happens again
Peer 1
Directory 1 Directory 2
Subscriptions update BTimestamp: 00:00:02
Timestamps:Zebus implementation (client side)
Peer 1
Directory 1 Directory 2
Subscriptions update ATimestamp: 00:00:01
Timestamps:Zebus implementation (client side)
Peer 1
Directory 1 Directory 2
Timestamp resolution is handled by C*Stored: Subscriptions update B
Timestamps:Zebus implementation (client side)
Timestamp resolution is handled by C*Stored: Subscriptions update B
Timestamps:Zebus implementation (client side)
Handle subscriptions efficiently
III
A Peer is already registered on the bus, and has subscribed to one event type
Peer 1
Directory 1
Peer ID MessageType Sub. Info
Peer.1 CoolEvent { misc. Info }Initial subscriptions
It now needs to add a new subscription
Peer 1
Directory 1
Peer ID MessageType Sub. Info
Peer.1 CoolEvent { misc. Info }Initial subscriptions
Peer 1
Directory 1
Peer ID MessageType Sub. Info
Peer.1 CoolEvent { misc. Info }
Peer.1 OtherEvent(new) { misc. Info }
It will send all its current subscriptions + the new one
Peer 1
Directory 1
Now imagine that the peer adds 10 000 subscriptions
Peer 1
Directory 1
Now imagine that the peer adds 10 000 subscriptions, one at a time
Peer 1
Directory 1
Peer ID MessageType Sub. Info
Peer.1 CoolEvent { misc. Info }
Peer.1 OtherEvent(new) { misc. Info }
…10 000 other events…
Peer.1 NthEvent { misc. Info }
10 000x times
Peer 1
Directory 1
Solution: Transfer subscriptions by message type
Peer 1
Directory 1
Peer ID MessageType Sub. Info
Peer.1 NewEvent (1st) { misc. Info }
Peer 1
Directory 1
Peer ID MessageType Sub. Info
Peer.1 NewEvent (2nd) { misc. Info }
And so on…
But then, how do we store that?
Pick the proper row granularity
IV
• We want to only do upserts (no read-before-write)
• We want Cassandra to use client timestamps to resolve out of order updates
• Subscriptions have to be updatable one by one
One subscription per rowPeer ID MessageType Subscription Info
Peer.18 CoolEvent { misc. Info }
… … …
• Primary Key (Peer Id, MessageType)
Directory
Peer 1 Peer 2
Peer 1 and 2 need to register on the Bus
Peer ID MessageType Sub. Info
Peer.1 CoolEvent { misc. Info }
Peer.1 OtherEvent { misc. Info }
Directory
Peer 1 Peer 2
• Peer 1 registers with 2 Subscriptions
Directory
Peer 1 Peer 2
• Peer 1 registers with 2 Subscriptions
• Directory starts to write to C*Writing
Directory
Peer 1 Peer 2
• Peer 1 registers with 2 Subscriptions
• Directory starts to write to C*• Peer 2 registers during the write
Register
Still writing
Directory
Peer 1 Peer 2
• Peer 1 registers with 2 Subscriptions
• Directory starts to write to C*• Peer 2 registers during the write• Since insertion was not over,
Peer 2 gets an incomplete state
Still writing
Peer ID MessageType Sub. Info
Peer.1 CoolEvent { misc. Info }
All subscriptions in one rowPeer ID All Subscriptions Blob
Peer.18 { blob }
… …
• Primary Key (Peer Id)
Peer 1
Directory 1 Directory 2
Peer 1 is already registered on the Bus and needs to add two Subscriptions
Peer 1
Add Subscription 1
Directory 1 Directory 2
• Peer 1 adds Subscription 1
Peer 1
Add Subscription 2
Directory 1 Directory 2
• Peer 1 adds Subscription 1• Peer 1 adds Subscription 2
Peer 1
Directory 1 Directory 2
A delay (again!) slows down Directory 1, causing bothSubscriptions to be added simultaneously
Directory 1 Directory 2
State:No subscriptions
Peer 1
• Peer 1 adds Subscription 1• Peer 1 adds Subscription 2• Directory 1 gets the state to add Subscription 1• Directory 2 gets the state to add Subscription 2
State:No subscriptions
Directory 1 Directory 2
Store:Subscription 1
Peer 1
• Peer 1 adds Subscription 1• Peer 1 adds Subscription 2• Directory 1 gets the state to add Subscription 1• Directory 2 gets the state to add Subscription 2• They both store the updated state to C*
Store:Subscription 2
Directory 1 Directory 2
Peer 1
• Peer 1 adds Subscription 1• Peer 1 adds Subscription 2• Directory 1 gets the state to add Subscription 1• Directory 2 gets the state to add Subscription 2• They both store the updated state to C*• Both store only their new subscription
Stored:Either Subscription 1 or 2 depending on which was the slowest
Solution: Compromise• We split subscriptions into Static and Dynamic subscriptions
• Static subscriptions cannot be updated one-by-one
• The Dynamic subscriptions list cannot be handled as atomic
• Each type has its own Column Family
Static subscriptions schema
Peer ID Endpoint IsUp […] StaticSubscriptions
Peer.18 tcp://1.2.3.4:123 true […] { blob }
… … … […] …
• Primary Key (Peer Id)
Dynamic subscriptions schema
Peer ID MessageType Subscription info
Peer.18 UserCreated { misc. Info }
… … …
• Primary Key (Peer Id, MessageType)
Miscellaneous bits of “fun”V
DateTime.Now• Calling DateTime.Now twice in a row can (and will) return the same value
• Its resolution is around 10ms
• We had to create a unique timestamp provider (add 1 tick when called in same « time bucket »)
Cassandra timestamp• .Net’s DateTime.Ticks is more precise than Cassandra’s timestamps (100
ns vs. 1 µs)
• Our custom time provider ensured uniqueness by adding 1 tick at a time, which was lost in translation
« UselessKey »
• The Directory CF is really small and needs to be retrieved entirely and frequently
• We used a « bool UselessKey » PartitionKey to force sequential storage and squeeze the last bits of speeds we needed
« UselessKey »
• Primary Key (UselessKey, Peer Id, MessageType)
• You should bench (after a flush) with your real data
UselessKey Peer ID MessageType Subscription info
false Peer.18 UserCreated { misc. Info }
… … …
Summary
When you have multiple servers sharing a state, Cassandra can save you some headaches
When you have multiple servers sharing a state, Cassandra can save you some headaches
The schema design is very critical, think it thoroughly and make sure you understand what is atomic and what is not
When you have multiple servers sharing a state, Cassandra can save you some headaches
The schema design is very critical, think it thoroughly and make sure you understand what is atomic and what is not
Client provided timestamps can be very useful, but be sure to generate unique timestamps
When you have multiple servers sharing a state, Cassandra can save you some headaches
The schema design is very critical, think it thoroughly and make sure you understand what is atomic and what is not
Client provided timestamps can be very useful, but be sure to generate unique timestamps
If you are not using Java, be well-aware of data types differences between your language and Java
Want to see the code ?www.github.com/Abc-Arbitrage
Want to see more code [email protected]
Questions ?