query processing, resource management, and approximation in a data stream management system jennifer...

68
Query Processing, Resource Query Processing, Resource Management, and Management, and Approximation in a Data Approximation in a Data Stream Management System Stream Management System Jennifer Widom Stanford University stanfordstreamdatamanager

Upload: rebecca-chandler

Post on 17-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Query Processing, Resource Query Processing, Resource Management, and Approximation in a Management, and Approximation in a

Data Stream Management SystemData Stream Management System

Jennifer Widom

Stanford University

stanfordstreamdatamanager

stanfordstreamdatamanager 2

Formula for a Formula for a Database Research ProjectDatabase Research Project

• Pick a simple but fundamental assumption underlying traditional database systems– Drop it

• Reconsider all aspects of data management and query processing– Many Ph.D. theses

– Prototype from scratch

stanfordstreamdatamanager 3

Following the FormulaFollowing the Formula

• We followed this formula once before– The LORE project

– Dropped assumption: Data has a fixed schema declared in advance

– Semistructured dataSemistructured data

• The STREAM Project– Dropped assumption:

First load data, then index it, then run queries

– Continuous data streams (+ continuous queries)Continuous data streams (+ continuous queries)

stanfordstreamdatamanager 4

Data StreamsData Streams• Continuous, unbounded, rapid, time-varying

streams of data elements

• Occur in a variety of modern applications– Network monitoring and traffic engineering– Sensor networks– Telecom call records– Financial applications– Web logs and click-streams– Manufacturing processes

• DSMSDSMS = Data Stream Management System

stanfordstreamdatamanager 5

DBMS versus DSMSDBMS versus DSMS

stanfordstreamdatamanager 6

DBMS versus DSMSDBMS versus DSMS• Persistent relations • Transient streams (and

persistent relations)

stanfordstreamdatamanager 7

DBMS versus DSMSDBMS versus DSMS• Persistent relations

• One-time queries

• Transient streams (and persistent relations)

• Continuous queries

stanfordstreamdatamanager 8

DBMS versus DSMSDBMS versus DSMS• Persistent relations

• One-time queries

• Random access

• Transient streams (and persistent relations)

• Continuous queries

• Sequential access

stanfordstreamdatamanager 9

DBMS versus DSMSDBMS versus DSMS• Persistent relations

• One-time queries

• Random access

• Access plan determined by query processor and physical DB design

• Transient streams (and persistent relations)

• Continuous queries

• Sequential access

• Unpredictable data characteristics and arrival patterns

stanfordstreamdatamanager 10

The STREAM SystemThe STREAM System• Data streams and stored relations

• Declarative language for registering continuous queries

• Flexible query plans

• Designed to cope with high data rates and query workloads– Graceful approximation when needed

– Careful resource allocation and usage

• Relational, centralized (for now)

stanfordstreamdatamanager 11

Contributions to DateContributions to Date

• Semantics for continuous queries

• Query plans

• Exploiting stream constraints

• Operator scheduling

• Approximation techniques

• Resource allocation to maximize precision

• Initial running prototype

stanfordstreamdatamanager 12

DSMS

Scratch Store

The (Simplified) Big PictureThe (Simplified) Big Picture

Input streams

RegisterQuery

StreamedResult

StoredResult

Archive

StoredRelations

stanfordstreamdatamanager 13

(Simplified) Network Monitoring(Simplified) Network Monitoring

RegisterMonitoring

Queries

DSMS

Scratch Store

Network measurements,Packet traces

IntrusionWarnings

OnlinePerformance

Metrics

Archive

LookupTables

stanfordstreamdatamanager 14

Using Conventional DBMSUsing Conventional DBMS

• Data streams as relation insertsrelation inserts, continuous queries as triggers triggers or materialized views materialized views

• Problems with this approach– Inserts are typically batched, high overhead

– Expressiveness: simple conditions (triggers), no built-in notion of sequence (views)

– No notion of approximation, resource allocation

– Current systems don’t scale to large # of triggers

– Views don’t provide streamed results

• But we (and others) plan to compareBut we (and others) plan to compare

stanfordstreamdatamanager 15

Declarative Language for Declarative Language for Continuous QueriesContinuous Queries

• A distinction between STREAM and the Aurora project– Aurora users directly manipulate one large

execution plan

– STREAM compiles declarative queries into individual plans, system may merge plans

– STREAM also supports direct entry of plans

• Syntax based on SQL, additional constructs for sliding windows and sampling

stanfordstreamdatamanager 16

Example Query 1Example Query 1Two streams, contrived for ease of examples:

Orders (orderID, customer, cost) Fulfillments (orderID, clerk)

stanfordstreamdatamanager 17

Example Query 1Example Query 1Two streams, contrived for ease of examples:

Orders (orderID, customer, cost) Fulfillments (orderID, clerk)

Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”

Select Sum(O.cost)From Orders O, Fulfillments F [Range 1 Day]Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”

stanfordstreamdatamanager 18

Example Query 1Example Query 1Two streams, contrived for ease of examples:

Orders (orderID, customer, cost) Fulfillments (orderID, clerk)

Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”

Select Sum(O.cost)From Orders O, Fulfillments F [Range 1 Day]Fulfillments F [Range 1 Day]Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”

stanfordstreamdatamanager 19

Example Query 1Example Query 1Two streams, contrived for ease of examples:

Orders (orderID, customer, cost) Fulfillments (orderID, clerk)

Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”

Select Sum(O.cost)From Orders O Orders O, Fulfillments F [Range 1 Day]

Where O.orderID = F.orderIDWhere O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”

stanfordstreamdatamanager 20

Example Query 1Example Query 1Two streams, contrived for ease of examples:

Orders (orderID, customer, cost) Fulfillments (orderID, clerk)

Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”

Select Sum(O.cost)From Orders O, Fulfillments F [Range 1 Day]Where O.orderID = F.orderID And F.clerk = “Sue”And F.clerk = “Sue” And O.customer = “Joe”And O.customer = “Joe”

stanfordstreamdatamanager 21

Example Query 1Example Query 1Two streams, contrived for ease of examples:

Orders (orderID, customer, cost) Fulfillments (orderID, clerk)

Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”

Select Sum(O.cost)Select Sum(O.cost)From Orders O, Fulfillments F [Range 1 Day]Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”

stanfordstreamdatamanager 22

Example Query 2Example Query 2

Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost

Select F.clerk, Max(O.cost)From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% SampleWhere O.orderID = F.orderIDGroup By F.clerk

stanfordstreamdatamanager 23

Example Query 2Example Query 2

Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost

Select F.clerk, Max(O.cost)From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% SampleFulfillments F [Partition By clerk Rows 5] 10% SampleWhere O.orderID = F.orderIDGroup By F.clerk

stanfordstreamdatamanager 24

Example Query 2Example Query 2

Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost

Select F.clerk, Max(O.cost)From Orders O, From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% SampleWhere O.orderID = F.orderIDWhere O.orderID = F.orderIDGroup By F.clerk

stanfordstreamdatamanager 25

Example Query 2Example Query 2

Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost

Select F.clerk, Max(O.cost)Select F.clerk, Max(O.cost)From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% SampleWhere O.orderID = F.orderIDGroup By F.clerkGroup By F.clerk

stanfordstreamdatamanager 26

Semantics of Database LanguagesSemantics of Database Languages

• An often neglected topic

• Traditional relational databases are in reasonable shape– Relational algebra SQL

• But triggers were a mess

• The semantics of an innocent-looking continuous query over data streams may not be obvious

stanfordstreamdatamanager 27

A Nonobvious Continuous QueryA Nonobvious Continuous Query

• Stream of stock quotes: Stocks(ticker,price)

• Monitor last 10 minutes of quotes:

Select From Stocks [Range 10 minutes]

• Is result a relation, a stream, or something else?

• If a relation, what exactly does it contain?

• If a stream, how does query differ from:Select From Stocks [Range 1 minute]or Select From Stocks []

stanfordstreamdatamanager 28

Our Semantics and Language Our Semantics and Language for Continuous Queries for Continuous Queries

• Abstract: Abstract: interpretation for CQs based on certain “black boxes”

• Concrete: Concrete: SQL-based instantiation for our system; includes syntactic shortcuts, defaults, equivalences

• Goals– CQs over multiple streams and relations

– Exploit relational semantics to the extent possible

– Easy queries should be easy to write, simple queries should do what you expect

stanfordstreamdatamanager 29

Relations and StreamsRelations and Streams

• Assume global, discrete, ordered time domain (more on this later)

• Relation– Maps time T to set-of-tuples R

• Stream– Set of (tuple,timestamp) elements

stanfordstreamdatamanager 30

ConversionsConversions

Streams Relations

Window specification

Special operators:Istream, Dstream, Rstream

Any relational query language

stanfordstreamdatamanager 31

Conversion DefinitionsConversion Definitions

• Stream-to-relation– S [W] is a relation — at time T it contains all tuples

in window W applied to stream S up to T

– When W = , contains all tuples in stream S up to T

• Relation-to-stream– Istream(R) contains all (r,T ) where rR at time T

but rR at time T–1

– Dstream(R) contains all (r,T ) where rR at time T–1 but rR at time T

– Rstream(R) contains all (r,T ) where rR at time T

stanfordstreamdatamanager 32

Abstract SemanticsAbstract Semantics

• Take any relational query language

• Can reference streams in place of relations– But must convert to relations using any window

specification language ( default window = [] )

• Can convert relations to streams– For streamed results

– For windows over relations (note: converts back to relation)

stanfordstreamdatamanager 33

Query Result at Time Query Result at Time TT

• Use all relations at time T

• Use all streams up to T, converted to relations

• Compute relational result

• Convert result to streams if desired

stanfordstreamdatamanager 34

TimeTime

• Easiest: global system clock– Stream elements and relation updates

timestamped on entry to system

• Application-defined time– Streams and relation updates contain application

timestamps, may be out of order

– Application generates “heartbeat”• Or deduce heartbeat from parameters: stream

skew, scrambling, latency, and clock progress– Query results in application time

stanfordstreamdatamanager 35

Abstract Semantics – Example 1Abstract Semantics – Example 1Select F.clerk, Max(O.cost)From O [], F [Rows 1000]Where O.orderID = F.orderIDGroup By F.clerk

• Maximum-cost order fulfilled by each clerk in last 1000 fulfillments

stanfordstreamdatamanager 36

Abstract Semantics – Example 1Abstract Semantics – Example 1Select F.clerk, Max(O.cost)From O [], F [Rows 1000]Where O.orderID = F.orderIDGroup By F.clerk

• At time T: entire stream O and last 1000 tuples of F as relations

• Evaluate query, update result relation at T

stanfordstreamdatamanager 37

Abstract Semantics – Example 1Abstract Semantics – Example 1Select IstreamIstream(F.clerk, Max(O.cost))From O [], F [Rows 1000]Where O.orderID = F.orderIDGroup By F.clerk

• At time T: entire stream O and last 1000 tuples of F as relations

• Evaluate query, update result relation at T

• Streamed result:Streamed result: New element (<clerk,max>,T) whenever <clerk,max> changes from T–1

stanfordstreamdatamanager 38

Abstract Semantics – Example 2Abstract Semantics – Example 2

Relation CurPrice(stock, price)

Select stock, Avg(price) From Istream(CurPrice) [Range 1 Day] Group By stock

• Average price over last day for each stock

stanfordstreamdatamanager 39

Abstract Semantics – Example 2Abstract Semantics – Example 2

Relation CurPrice(stock, price)

Select stock, Avg(price) From Istream(CurPrice) [Range 1 Day] Group By stock

• Istream provides history of CurPrice

• Window on history, back to relation, group and aggregate

stanfordstreamdatamanager 40

Concrete Language – CQLConcrete Language – CQL• Relational query language: SQL

• Window spec. language derived from SQL-99– Tuple-based, time-based, partitioned

• Syntactic shortcuts and defaultsSyntactic shortcuts and defaults– So easy queries are easy to write and simple

queries do what you expect

• EquivalencesEquivalences– Basis for query-rewrite optimizations

– Includes all relational equivalences, plus new stream-based ones

stanfordstreamdatamanager 41

Two Extremely Simple Two Extremely Simple CQL ExamplesCQL Examples

Select From Strm• Had better return Strm (It does)

– Default window for Strm– Default Istream for result

Select From Strm, Rel Where Strm.A = Rel.B• Often want “NOW” window for Strm

• But may not want as default

stanfordstreamdatamanager 42

Query ExecutionQuery Execution

• When a continuous query is registered, generation a query planquery plan– Users can also register plans directly

• Plans composed of three main components:– OperatorsOperators (as in most conventional DBMS’s)

– Inter-operator QueuesQueues (as in many conventional DBMS’s)

– State (synopses)State (synopses)

• Global schedulerscheduler for plan execution

stanfordstreamdatamanager 43

Operators and StateOperators and State

• State (synopsessynopses)– Summarize tuples seen so far (exact or

approximate) for operators requiring history

– To implement windows

• Example: synopsis join– Sliding-window join

– Approximation of full join

State1 State2⋈

stanfordstreamdatamanager 44

Simple Query PlanSimple Query Plan

State4⋈State3

Stream1 Stream2

Stream3

Q1 Q2

State1 State2⋈

Scheduler

stanfordstreamdatamanager 45

Some Issues in Query Plan Generation Some Issues in Query Plan Generation

• Compatibility and conversions for streams and relations (+/- streams+/- streams)

• State sharing, incremental computation

• Windowed joinsWindowed joins: Multiway versus 2-way

• Windows in general: push down, pull up, split, merge, …

• Time coordination, operator-level heartbeats

stanfordstreamdatamanager 46

Memory Overhead in Query ProcessingMemory Overhead in Query Processing

• Queues + State

• Continuous queries keep state indefinitely

• Online requirements suggest using memory rather than disk– But we realize this assumption is shaky

stanfordstreamdatamanager 47

Memory Overhead in Query ProcessingMemory Overhead in Query Processing

• Queues + State

• Continuous queries keep state indefinitely

• Online requirements suggest using memory rather than disk– But we realize this assumption is shaky

• Goal: minimize memory use while providing Goal: minimize memory use while providing timely, accurate answerstimely, accurate answers

stanfordstreamdatamanager 48

Reducing Memory OverheadReducing Memory Overhead

Two main techniques to date

1) Exploit constraints on streamsconstraints on streams to reduce statereduce state

2) Clever operator schedulingoperator scheduling to reduce queue sizesreduce queue sizes

stanfordstreamdatamanager 49

Exploiting Stream ConstraintsExploiting Stream Constraints

• For most queries, unbounded memory is required for arbitrary streams [PODS ’01]

stanfordstreamdatamanager 50

Exploiting Stream ConstraintsExploiting Stream Constraints

• For most queries, unbounded memory is required for arbitraryarbitrary streams [PODS ’01]

• But streams may exhibit constraints that reduce, bound, or even eliminate state

stanfordstreamdatamanager 51

Exploiting Stream ConstraintsExploiting Stream Constraints

• For most queries, unbounded memory is required for arbitrary streams [PODS ’01]

• But streams may exhibit constraints that reduce, bound, or even eliminate state

• Conventional database constraints– Redefined for streams

– “Relaxed” for stream environment

stanfordstreamdatamanager 52

Stream ConstraintsStream Constraints

• Each constraint type defines adherence adherence parameterparameter k

• Clustered(k) for attribute S.A

• Ordered(k) for attribute S.A

• Referential-Integrity(k) for join S1 S2

stanfordstreamdatamanager 53

Algorithm for Exploiting ConstraintsAlgorithm for Exploiting Constraints

• Input– Any Select-Project-Join query over streams

– Any set of k-constraints

• Output– Query execution plan that reduces or eliminates

state based on k-constraints

– If constraints violated, get approximate resultIf constraints violated, get approximate result

stanfordstreamdatamanager 54

Constraint ExamplesConstraint ExamplesOrders (orderID, cost)Fulfillments (orderID, portion, clerk)

Query: Many-one join F O

stanfordstreamdatamanager 55

Constraint ExamplesConstraint ExamplesOrders (orderID, cost)Fulfillments (orderID, portion, clerk)

Query: Many-one join F O

• Clustered(k) on F.orderID Matched O tuples discarded after k arrivals of non-

matching F’s

stanfordstreamdatamanager 56

Constraint ExamplesConstraint ExamplesOrders (orderID, cost)Fulfillments (orderID, portion, clerk)

Query: Many-one join F O

• Clustered(k) on F.orderID Matched O tuples discarded after k arrivals of non-

matching F’s

• Referential-Integrity(k) F tuples retained for at most k arrivals of O tuples

stanfordstreamdatamanager 57

Operator SchedulingOperator Scheduling• Global scheduler invokes run method of query

plan operators with “timeslice” parameter

• Many possible scheduling objectives: minimize latency, inaccuracy, memory use, computation, starvation, …

– First scheduler: round-robin

Second scheduler: minimize queue sizes

– Third scheduler: minimize combination of queue sizes and latency

stanfordstreamdatamanager 58

SchedulingScheduling Algorithm Algorithm• Goal: minimize total queue size for

unpredictable, bursty stream arrival patterns

• Proven: within constant factor of any “clairvoyant” strategy for some queries

• Empirical results: large savings over naive strategies for many queries

• But minimizing queue sizes is at odds with minimizing latency

stanfordstreamdatamanager 59

Scheduling Algorithm (contd.)Scheduling Algorithm (contd.)

• Always schedule ready chainready chain with steepest downslope

• Plan progress chartprogress chart: memory delta per unit time

• Independent scheduling of operators makes mistakes

• Consider chainschains of operators

Mem

ory

1 3 4 8

10.9

2

0.2

Op1Op2

Op3

Op4

Computation time

stanfordstreamdatamanager 60

ApproximationApproximation

• Why approximate?– Memory requirement too high, even with

constraints and clever scheduling

– Can’t process streams fast enough for query load

stanfordstreamdatamanager 61

Approximation (cont’d)Approximation (cont’d)

• Static:Static: rewrite queries to add (or shrink) sampling or windows+ User can participate, predictable behavior

– Doesn’t consider dynamic conditions

• Dynamic:Dynamic: modify query plan – insert sampling operators, shrink windows, load shedding+ Adapts to current resource availability

– How to convey to user? (major open issue)(major open issue)

stanfordstreamdatamanager 62

Resource Allocation to Resource Allocation to Maximize PrecisionMaximize Precision

• Query plan: tree of operators

• Each operator has function from resource function from resource allocation allocation RR to precision to precision PP

• Simple (FP,FN) precision model– FP = probability of answer being false positive

– FN = # false negatives per correct answer

stanfordstreamdatamanager 63

Precision of an OperatorPrecision of an Operator

State1 State2⋈

Approx.Answer

ExactAnswer

Precision:Precision: FP = False PositivesFP = False Positives FN = False NegativesFN = False Negatives

stanfordstreamdatamanager 64

Precision of Two OperatorsPrecision of Two Operators

(FP1,FN1)(FP1,FN1)

State2State1 ⋈

Approx.Answer

ExactAnswer

State3

Approx.Answer

ExactAnswer (FP2,FN2)(FP2,FN2)

stanfordstreamdatamanager 65

The Problem StatementThe Problem Statement

• Given a plan, precision function for each operator in the plan, and R total resources, allocate R to operators to maximize result precision

• Solution– For each operator type: formula for calculating For each operator type: formula for calculating

output precision given input precision and operator output precision given input precision and operator resource allocationresource allocation

– Assume <0,0> precision for input streams

– Becomes optimization problem

stanfordstreamdatamanager 66

The Holy GrailThe Holy Grail

• Given:– Declarative query

– Resources

– Constraints on streams

Generate plan and resource allocation that takes advantage of constraints and maximizes precision

• Do it for multiple (weighted) queries, dynamically and adaptively, and convey what’s happening to the user

stanfordstreamdatamanager 67

The Stream Systems LandscapeThe Stream Systems Landscape

• (At least) three general-purpose DSMS prototypes underway– STREAMSTREAM (Stanford)

– AuroraAurora (Brown, Brandeis, MIT)

– TelegraphCQ TelegraphCQ (Berkeley)

• All will be demo’d at SIGMOD ’03

• Cooperating to develop stream system stream system benchmarkbenchmarkGoal: demonstrate that conventional systems are

far inferior for data stream applications

stanfordstreamdatamanager 68

http://www-db.stanford.edu/streamhttp://www-db.stanford.edu/stream

Contributors: Arvind Arasu, Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, Justin Rosenstein, Rohit Varma