snappydata overview niketechtalk 11/19/15
TRANSCRIPT
SnappyDataGetting Spark ready for real-time,
operational analytics
www.snappydata.io
Jags Ramnarayanjramnarayan@snappydat
a.ioCo-founder SnappyData
Nov 2015
SnappyData - an EMC/Pivotal spin out● New Spark-based open source project started by
Pivotal GemFire founders+engineers● Decades of in-memory data management experience● Focus on real-time, operational analytics: Spark inside
an OLTP+OLAP database
www.snappydata.io
Lambda Architecture (LA) for Analytics
SnappyData Focus
Perspective on LA for real time
In-Memory DB
Interactive queries, updates
Deep Scale, High volumeMPP DBTransform
Data-in-motion
Analytics
Application
Streams
Alerts
Use case: TelemetryRevenue
GenerationReal-time Location based
Mobile Advertising (B2B2C)
Location Based Services (B2C, B2B, B2B2C)
Revenue Protection
Customer experience management to reduce
churn
Customers Sentiment analysis
Network EfficiencyNetwork bandwidth
optimisation
Network signalling maximisation
• Network optimization– E.g. re-reroute call to another cell tower if congestion detected
• Location based Ads– Match incoming event to Subscriber profile; If ‘Opt-in’ show location sensitive Ad
• Challenge: Too much streaming data– Many subscribers, lots of 2G/3G/4G voice/data– Network events: location events, CDRs, network issues
Challenge - Keeping up with streams
In-Memory DB
Interactive queries, updates
Deep Scale, High volumeMPP DBTransform
Data-in-motion
Analytics
Application
Streams
Alerts
• Millions of events/sec• HA – Continuously Ingest• Cannot throttle the stream• Diverse formats
Challenge - Transform is expensive
In-Memory DB
Interactive queries, updates
Deep Scale, High volumeMPP DBTransform
Data-in-motion
Analytics
Application
Streams
Alerts
• Filter, Normalize, transform• Need reference data to
normalize – point lookups
Reference DB(Enterprise Oracle, …)
Challenge - Stream joins, correlations
In-Memory DB
Interactive queries, updates
Deep Scale, High volumeMPP DBTransform
Data-in-motion
Analytics
Application
Streams
Alerts
Analyze over time window● Simple rules - (CallDroppedCount > threshold) then alert● Or, Complex (OLAP like query)● TopK, Trending, Join with reference data, correlate with
history
How do you keep up with OLAP style analytics with millions of events in window and billions of records in ref data?
Challenge - State management
In-Memory DB
Interactive queries, updates
Deep Scale, High volumeMPP DBTransform
Data-in-motion
Analytics
Application
Streams
Alerts
Manage generated state● Mutating state: millions of
counters● “Once and only once”● Consistency across distributed
system● State HA
Challenge - Interactive Query speed
In-Memory DB
Interactive queries, updates
Deep Scale, High volumeMPP DBTransform
Data-in-motion
Analytics
Application
Streams
Alerts
Interactive queries- OLAP style
queries- High
concurrency- Low response
time
Today: queue -> process -> NoSQLMessaging cluster adds extra hops, management
No distributed, HA Data store
Streaming joins, or with external state is slow and not scalable in many cases
SnappyData: A new approach
Single unified HA cluster: OLTP + OLAP + Stream for real-time analytics
Batch design, high throughput
Real-time design center- Low latency, HA,
concurrent
Vision: Drastically reduce the cost and complexity in modern big data
SnappyData: A new approachSingle unified HA cluster: OLTP + OLAP +
Stream for real-time analytics
Batch design, high throughput
Real time operational Analytics – TBs in memory
RDB
Rows
TxnColumn
ar
API
Stream processing
ODBC, JDBC, REST
Spark - Scala, Java, Python, R
HDFSAQP
First commercial project on Approximate Query Processing(AQP)
MPP DB
Index
Why columnar storage?
Why Spark?● Blends streaming, interactive, and batch
analytics ● Appeals to Java, R, Python, Scala folks● Succinct programs● Rich set of transformations and libraries● RDD and fault tolerance without replication● Stream processing with high throughput
Spark Myths● It is a distributed in-memory database
○ It’s a computational framework with immutable caching
● It is Highly Available○ Fault tolerance is not the same as HA
● Well suited for real time, operational environments○ Does not handle concurrency well
Common Spark Streaming Architecture
Driver
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 timecassandra
Kafka queue
Client submits stream App
Queue is buffered in executor. Driver submits batch job
every second. This results in a
new RDD pushed to stream(batch
from buffer)Short term immutable
state.Long term – In external
DB
Challenge: Spark driver not HA
DriverExecutor – spark
engine
Executor – spark engine
Client submits stream App If Driver fails –
Executors automatically exit
All CACHED STATE HAS TO BE
RE_HYDRATED
Challenge: Sharing state
DriverClient1
Executor• Spark designed for
total isolation across client apps
• Sharing state across clients requires external DB/Tachyon
Executor
DriverClient2Executor
Executor
Challenge: External state management
Driver
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
timecassandra
Kafka queue
Client submits stream App
Key based access might keep upBut, Joins, analytic operators is a problem.
Serialization, copying costs are too high, esp in JVMs
newDStream = wordDstream.updateStateByKey[Int] (func) - Spark capability to update state as batches arrive requires full iteration over RDD
Challenge: “Once and only once” = hard
Executor
ExecutorRecovered partition
cassandra
X = 10X = 20
X = 30
X = X+10
X = X+10
OK
Challenge: Always on
DriverExecutor – spark
engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Executor – spark engine
RDD Partition
@t0RDD Partition
@t2RDD Partition
@t1 time
Kafka queue
Client submits stream App
HA requirement : If something fails, there is always a redundant copy that is fully in sync. Failover is instantaneous
Fault tolerance in Spark: Recover state from the original source or checkpoint by tracking lineage. Can take too long.
Challenge: Concurrent queries too slow
SELECT SUBSTR(sourceIP, 1, X),
SUM(adRevenue)
FROM uservisits
GROUP BY SUBSTR(sourceIP, 1, X)
Berkeley AMPLab Big Data Benchmark -- AWS m2.4xlarge ; total of 342 GB
SnappyData: P2P cluster w/ consensus
Data Server JVM1
Data Server JVM2
Data Server JVM3
● Cluster elects a coordinator● Consistent views across
members● Virtual synchrony across
members● WHY? Strong
consistency during replication, failure detection is accurate and fast
Colocated row/column Tables in Spark
RowTable
ColumnTable
Spark ExecutorTASK
Spark Block Manager
Stream processing
RowTable
ColumnTable
Spark ExecutorTASK
Spark Block Manager
Stream processing
RowTable
ColumnTable
Spark ExecutorTASK
Spark Block Manager
Stream processing
● Spark Executors are long lived and shared across multiple apps
● Gem Memory Mgr and Spark Block Mgr integrated
Table can be partitioned or replicated
ReplicatedTable
Partitioned Table(Buckets A-H)
ReplicatedTable
Partitioned Table(Buckets I-P)
consistent replica on each node
PartitionReplica(Buckets A-H)
ReplicatedTable
Partitioned Table(Buckets Q-W)
PartitionReplica(Buckets I-P)
Data partitioned with one or more replicas
Linearly scale with shared partitions
Spark Executor
Spark Executor
Kafka queue
Subscriber N-Z
Subscriber A-M
Subscriber A-M Ref data
Linearly scale with partition pruning
Input queue, Stream, IMDB, Output queue all share the same partitioning strategy
Point access, updates, fast writes● Row tables with PKs are distributed HashMaps
○ with secondary indexes● Support for transactional semantics
○ read_committed, repeatable_read● Support for scalable high write rates
○ streaming data goes through stages○ queue streams, intermediate storage (Delta row
buffer), immutable compressed columns
Full Spark Compatibility● Any table is also visible as a DataFrame● Any RDD[T]/DataFrame can be stored in
SnappyData tables● Tables appear like any JDBC sourced table
○ But, in executor memory by default● Addtional API for updates, inserts, deletes
//Save a dataFrame using the spark context … context.createExternalTable(”T1", "ROW", myDataFrame.schema, props );
//save using DataFrame APIdataDF.write.format("ROW").mode(SaveMode.Append).options(props).saveAsTable(”T1");
Extends SparkCREATE [Temporary] TABLE [IF NOT EXISTS] table_name ( <column definition> ) USING ‘JDBC | ROW | COLUMN ’OPTIONS ( COLOCATE_WITH 'table_name', // Default none PARTITION_BY 'PRIMARY KEY | column name', // will be a replicated table, by default REDUNDANCY '1' , // Manage HA PERSISTENT "DISKSTORE_NAME ASYNCHRONOUS | SYNCHRONOUS",
// Empty string will map to default disk store. OFFHEAP "true | false" EVICTION_BY "MEMSIZE 200 | COUNT 200 | HEAPPERCENT",….. [AS select_statement];
Key feature: Synopses Data● Maintain stratified samples
○ Intelligent sampling to keep error bounds low● Probabilistic data
○ TopK for time series (using time aggregation CMS, item aggregation)
○ Histograms, HyperLogLog, Bloom Filters, WaveletsCREATE SAMPLE TABLE sample-table-name USING columnar OPTIONS (
BASETABLE ‘table_name’ // source column table or stream table[ SAMPLINGMETHOD "stratified | uniform" ]STRATA name (
QCS (“comma-separated-column-names”)[ FRACTION “frac” ]
),+ // one or more QCS
Stratified Sampling Spark Demo
www.snappydata.io
Driver HA, JobServer for interactive jobs
● REST based JobServer for sharing a single Context across clients○ clients use REST to execute streaming jobs, queries, DML○ secondary JobServer for HA○ primary election using Gem clustering
● Native SnappyData cluster manager for long running executors○ makes resources (executors) long running○ resuse same executors across apps, jobs
● Low latency scheduling that skips the Spark driver altogether
Unified OLAP/OLTP streaming w/ Spark
● Far fewer resources: TB problem becomes GB.○ CPU contention drops
● Far less complex○ single cluster for stream ingestion, continuous queries,
interactive queries and machine learning● Much faster
○ compressed data managed in distributed memory in columnar form reduces volume and is much more responsive
www.snappydata.io
SnappyData is Open Source● Beta will be on github before December. We are looking
for contributors!● Learn more & register for beta: www.snappydata.io● Connect:
○ twitter: www.twitter.com/snappydata○ facebook: www.facebook.com/snappydata○ linkedin: www.linkedin.com/snappydata○ slack: http://snappydata-slackin.herokuapp.com○ IRC: irc.freenode.net #snappydata
Extras
www.snappydata.io
OLAP/OLTP with SynopsesCQ
Subscriptions
OLAP Query Engine
Micro Batch Processing
Module (Plugins)
Sliding WindowEmits Batches[ ]
User Applications processing Events & Issuing
Interactive Queries
Summary DB
▪ Time Series with decay▪ TopK, Frequency Summary
Structures▪ Counters▪ Histograms▪ Stratified Samples▪ Raw Data Windows
Exact DB(Row + column
oriented)
Not pancea, but comes close● Synopses require prior workload knowledge● Not all queries … complex queries will result in high
error rates○ single cluster for stream ingestion and analytics queries
(both streaming and interactive)● Our strategy - be adjunct to MPP databases...
○ first compute the error estimate; if error is above tolerance delegate to exact store
Adjunct store in certain scenarios
Speed/Accuracy tradeoffEr
ror
30 mins
Time to Execute on
Entire Dataset
InteractiveQueries
2 secExecution Time (Sample Size)
41
Stratified Sampling● Random sampling has intuitive semantics● However, data is typically skewed and our queries are
multi-dimensional○ avg sales order price for each product class for each
geography○ some products may have little to no sales○ stratification ensures that each “group” (product class) is
represented
Stratified Sampling Challenges● Solutions exist for batch data (BlinkDB)● Needs to work for infinite streams of data
○ Answer: use combination of Stratified with other techniques like Bernouli/reservoir sampling
○ Exponentially decay over time
Dealing with errors and latency● Well known error techniques for “closed form
aggregations”● Exploring other techniques -- Analytical Bootstrap● User can specify error bound with confidence intervalSELECT avg(sessionTime) FROM Table
WHERE city=‘San Francisco’ERROR 0.1 CONFIDENCE 95.0%
● Engine would determine if it can satisfy error bound first● If not, delegate execution to an “exact” store (GPDB, etc)● Query execution can also be latency bounded
Sketching techniques● Sampling not effective for outlier detection
○ MAX/MIN etc● Other probabilistic structures like CMS, heavy hitters, etc● We implemented Hokusai
○ capture frequencies of items in time series● Design permits TopK queries over arbitrary trim intervals(Top100 popular URLs)SELECT pageURL, count(*) frequency FROM TableWHERE …. GROUP BY ….ORDER BY frequency DESCLIMIT 100
Demo
Zeppelin Spark
Interpreter(Driver)
Zeppelin Server
Row cacheColumnarcompressed
Spark Executor JVM
Row cacheColumnarcompressed
Spark Executor JVM
Row cacheColumnarcompressed
Spark Executor JVM
A new approach to Real Time Analytics
Streaming
Analytics Probabilistic data
Distributed In-
Memory SQL
Deep integration of Spark + Gem
Unified cluster, AlwaysOn, Cloud ready
For Real time analytics
Vision – Drastically reduce the cost and complexity in modern big data. …Using fraction of the resources10X better response time, drop resource cost 10X,
reduce complexity 10X
Deep Scale, High volume
MPP DBIntegrate with