query processing, resource management, and approximation in a data stream management system jennifer...
TRANSCRIPT
Query Processing, Resource Query Processing, Resource Management, and Approximation in a Management, and Approximation in a
Data Stream Management SystemData Stream Management System
Jennifer Widom
Stanford University
stanfordstreamdatamanager
stanfordstreamdatamanager 2
Formula for a Formula for a Database Research ProjectDatabase Research Project
• Pick a simple but fundamental assumption underlying traditional database systems– Drop it
• Reconsider all aspects of data management and query processing– Many Ph.D. theses
– Prototype from scratch
stanfordstreamdatamanager 3
Following the FormulaFollowing the Formula
• We followed this formula once before– The LORE project
– Dropped assumption: Data has a fixed schema declared in advance
– Semistructured dataSemistructured data
• The STREAM Project– Dropped assumption:
First load data, then index it, then run queries
– Continuous data streams (+ continuous queries)Continuous data streams (+ continuous queries)
stanfordstreamdatamanager 4
Data StreamsData Streams• Continuous, unbounded, rapid, time-varying
streams of data elements
• Occur in a variety of modern applications– Network monitoring and traffic engineering– Sensor networks– Telecom call records– Financial applications– Web logs and click-streams– Manufacturing processes
• DSMSDSMS = Data Stream Management System
stanfordstreamdatamanager 6
DBMS versus DSMSDBMS versus DSMS• Persistent relations • Transient streams (and
persistent relations)
stanfordstreamdatamanager 7
DBMS versus DSMSDBMS versus DSMS• Persistent relations
• One-time queries
• Transient streams (and persistent relations)
• Continuous queries
stanfordstreamdatamanager 8
DBMS versus DSMSDBMS versus DSMS• Persistent relations
• One-time queries
• Random access
• Transient streams (and persistent relations)
• Continuous queries
• Sequential access
stanfordstreamdatamanager 9
DBMS versus DSMSDBMS versus DSMS• Persistent relations
• One-time queries
• Random access
• Access plan determined by query processor and physical DB design
• Transient streams (and persistent relations)
• Continuous queries
• Sequential access
• Unpredictable data characteristics and arrival patterns
stanfordstreamdatamanager 10
The STREAM SystemThe STREAM System• Data streams and stored relations
• Declarative language for registering continuous queries
• Flexible query plans
• Designed to cope with high data rates and query workloads– Graceful approximation when needed
– Careful resource allocation and usage
• Relational, centralized (for now)
stanfordstreamdatamanager 11
Contributions to DateContributions to Date
• Semantics for continuous queries
• Query plans
• Exploiting stream constraints
• Operator scheduling
• Approximation techniques
• Resource allocation to maximize precision
• Initial running prototype
stanfordstreamdatamanager 12
DSMS
Scratch Store
The (Simplified) Big PictureThe (Simplified) Big Picture
Input streams
RegisterQuery
StreamedResult
StoredResult
Archive
StoredRelations
stanfordstreamdatamanager 13
(Simplified) Network Monitoring(Simplified) Network Monitoring
RegisterMonitoring
Queries
DSMS
Scratch Store
Network measurements,Packet traces
IntrusionWarnings
OnlinePerformance
Metrics
Archive
LookupTables
stanfordstreamdatamanager 14
Using Conventional DBMSUsing Conventional DBMS
• Data streams as relation insertsrelation inserts, continuous queries as triggers triggers or materialized views materialized views
• Problems with this approach– Inserts are typically batched, high overhead
– Expressiveness: simple conditions (triggers), no built-in notion of sequence (views)
– No notion of approximation, resource allocation
– Current systems don’t scale to large # of triggers
– Views don’t provide streamed results
• But we (and others) plan to compareBut we (and others) plan to compare
stanfordstreamdatamanager 15
Declarative Language for Declarative Language for Continuous QueriesContinuous Queries
• A distinction between STREAM and the Aurora project– Aurora users directly manipulate one large
execution plan
– STREAM compiles declarative queries into individual plans, system may merge plans
– STREAM also supports direct entry of plans
• Syntax based on SQL, additional constructs for sliding windows and sampling
stanfordstreamdatamanager 16
Example Query 1Example Query 1Two streams, contrived for ease of examples:
Orders (orderID, customer, cost) Fulfillments (orderID, clerk)
stanfordstreamdatamanager 17
Example Query 1Example Query 1Two streams, contrived for ease of examples:
Orders (orderID, customer, cost) Fulfillments (orderID, clerk)
Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”
Select Sum(O.cost)From Orders O, Fulfillments F [Range 1 Day]Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”
stanfordstreamdatamanager 18
Example Query 1Example Query 1Two streams, contrived for ease of examples:
Orders (orderID, customer, cost) Fulfillments (orderID, clerk)
Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”
Select Sum(O.cost)From Orders O, Fulfillments F [Range 1 Day]Fulfillments F [Range 1 Day]Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”
stanfordstreamdatamanager 19
Example Query 1Example Query 1Two streams, contrived for ease of examples:
Orders (orderID, customer, cost) Fulfillments (orderID, clerk)
Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”
Select Sum(O.cost)From Orders O Orders O, Fulfillments F [Range 1 Day]
Where O.orderID = F.orderIDWhere O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”
stanfordstreamdatamanager 20
Example Query 1Example Query 1Two streams, contrived for ease of examples:
Orders (orderID, customer, cost) Fulfillments (orderID, clerk)
Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”
Select Sum(O.cost)From Orders O, Fulfillments F [Range 1 Day]Where O.orderID = F.orderID And F.clerk = “Sue”And F.clerk = “Sue” And O.customer = “Joe”And O.customer = “Joe”
stanfordstreamdatamanager 21
Example Query 1Example Query 1Two streams, contrived for ease of examples:
Orders (orderID, customer, cost) Fulfillments (orderID, clerk)
Total cost of orders fulfilled over the last day by clerk “Sue” for customer “Joe”
Select Sum(O.cost)Select Sum(O.cost)From Orders O, Fulfillments F [Range 1 Day]Where O.orderID = F.orderID And F.clerk = “Sue” And O.customer = “Joe”
stanfordstreamdatamanager 22
Example Query 2Example Query 2
Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost
Select F.clerk, Max(O.cost)From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% SampleWhere O.orderID = F.orderIDGroup By F.clerk
stanfordstreamdatamanager 23
Example Query 2Example Query 2
Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost
Select F.clerk, Max(O.cost)From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% SampleFulfillments F [Partition By clerk Rows 5] 10% SampleWhere O.orderID = F.orderIDGroup By F.clerk
stanfordstreamdatamanager 24
Example Query 2Example Query 2
Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost
Select F.clerk, Max(O.cost)From Orders O, From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% SampleWhere O.orderID = F.orderIDWhere O.orderID = F.orderIDGroup By F.clerk
stanfordstreamdatamanager 25
Example Query 2Example Query 2
Using a 10% sample of the Fulfillments stream, take the 5 most recent fulfillments for each clerk and return the maximum cost
Select F.clerk, Max(O.cost)Select F.clerk, Max(O.cost)From Orders O, Fulfillments F [Partition By clerk Rows 5] 10% SampleWhere O.orderID = F.orderIDGroup By F.clerkGroup By F.clerk
stanfordstreamdatamanager 26
Semantics of Database LanguagesSemantics of Database Languages
• An often neglected topic
• Traditional relational databases are in reasonable shape– Relational algebra SQL
• But triggers were a mess
• The semantics of an innocent-looking continuous query over data streams may not be obvious
stanfordstreamdatamanager 27
A Nonobvious Continuous QueryA Nonobvious Continuous Query
• Stream of stock quotes: Stocks(ticker,price)
• Monitor last 10 minutes of quotes:
Select From Stocks [Range 10 minutes]
• Is result a relation, a stream, or something else?
• If a relation, what exactly does it contain?
• If a stream, how does query differ from:Select From Stocks [Range 1 minute]or Select From Stocks []
stanfordstreamdatamanager 28
Our Semantics and Language Our Semantics and Language for Continuous Queries for Continuous Queries
• Abstract: Abstract: interpretation for CQs based on certain “black boxes”
• Concrete: Concrete: SQL-based instantiation for our system; includes syntactic shortcuts, defaults, equivalences
• Goals– CQs over multiple streams and relations
– Exploit relational semantics to the extent possible
– Easy queries should be easy to write, simple queries should do what you expect
stanfordstreamdatamanager 29
Relations and StreamsRelations and Streams
• Assume global, discrete, ordered time domain (more on this later)
• Relation– Maps time T to set-of-tuples R
• Stream– Set of (tuple,timestamp) elements
stanfordstreamdatamanager 30
ConversionsConversions
Streams Relations
Window specification
Special operators:Istream, Dstream, Rstream
Any relational query language
stanfordstreamdatamanager 31
Conversion DefinitionsConversion Definitions
• Stream-to-relation– S [W] is a relation — at time T it contains all tuples
in window W applied to stream S up to T
– When W = , contains all tuples in stream S up to T
• Relation-to-stream– Istream(R) contains all (r,T ) where rR at time T
but rR at time T–1
– Dstream(R) contains all (r,T ) where rR at time T–1 but rR at time T
– Rstream(R) contains all (r,T ) where rR at time T
stanfordstreamdatamanager 32
Abstract SemanticsAbstract Semantics
• Take any relational query language
• Can reference streams in place of relations– But must convert to relations using any window
specification language ( default window = [] )
• Can convert relations to streams– For streamed results
– For windows over relations (note: converts back to relation)
stanfordstreamdatamanager 33
Query Result at Time Query Result at Time TT
• Use all relations at time T
• Use all streams up to T, converted to relations
• Compute relational result
• Convert result to streams if desired
stanfordstreamdatamanager 34
TimeTime
• Easiest: global system clock– Stream elements and relation updates
timestamped on entry to system
• Application-defined time– Streams and relation updates contain application
timestamps, may be out of order
– Application generates “heartbeat”• Or deduce heartbeat from parameters: stream
skew, scrambling, latency, and clock progress– Query results in application time
stanfordstreamdatamanager 35
Abstract Semantics – Example 1Abstract Semantics – Example 1Select F.clerk, Max(O.cost)From O [], F [Rows 1000]Where O.orderID = F.orderIDGroup By F.clerk
• Maximum-cost order fulfilled by each clerk in last 1000 fulfillments
stanfordstreamdatamanager 36
Abstract Semantics – Example 1Abstract Semantics – Example 1Select F.clerk, Max(O.cost)From O [], F [Rows 1000]Where O.orderID = F.orderIDGroup By F.clerk
• At time T: entire stream O and last 1000 tuples of F as relations
• Evaluate query, update result relation at T
stanfordstreamdatamanager 37
Abstract Semantics – Example 1Abstract Semantics – Example 1Select IstreamIstream(F.clerk, Max(O.cost))From O [], F [Rows 1000]Where O.orderID = F.orderIDGroup By F.clerk
• At time T: entire stream O and last 1000 tuples of F as relations
• Evaluate query, update result relation at T
• Streamed result:Streamed result: New element (<clerk,max>,T) whenever <clerk,max> changes from T–1
stanfordstreamdatamanager 38
Abstract Semantics – Example 2Abstract Semantics – Example 2
Relation CurPrice(stock, price)
Select stock, Avg(price) From Istream(CurPrice) [Range 1 Day] Group By stock
• Average price over last day for each stock
stanfordstreamdatamanager 39
Abstract Semantics – Example 2Abstract Semantics – Example 2
Relation CurPrice(stock, price)
Select stock, Avg(price) From Istream(CurPrice) [Range 1 Day] Group By stock
• Istream provides history of CurPrice
• Window on history, back to relation, group and aggregate
stanfordstreamdatamanager 40
Concrete Language – CQLConcrete Language – CQL• Relational query language: SQL
• Window spec. language derived from SQL-99– Tuple-based, time-based, partitioned
• Syntactic shortcuts and defaultsSyntactic shortcuts and defaults– So easy queries are easy to write and simple
queries do what you expect
• EquivalencesEquivalences– Basis for query-rewrite optimizations
– Includes all relational equivalences, plus new stream-based ones
stanfordstreamdatamanager 41
Two Extremely Simple Two Extremely Simple CQL ExamplesCQL Examples
Select From Strm• Had better return Strm (It does)
– Default window for Strm– Default Istream for result
Select From Strm, Rel Where Strm.A = Rel.B• Often want “NOW” window for Strm
• But may not want as default
stanfordstreamdatamanager 42
Query ExecutionQuery Execution
• When a continuous query is registered, generation a query planquery plan– Users can also register plans directly
• Plans composed of three main components:– OperatorsOperators (as in most conventional DBMS’s)
– Inter-operator QueuesQueues (as in many conventional DBMS’s)
– State (synopses)State (synopses)
• Global schedulerscheduler for plan execution
stanfordstreamdatamanager 43
Operators and StateOperators and State
• State (synopsessynopses)– Summarize tuples seen so far (exact or
approximate) for operators requiring history
– To implement windows
• Example: synopsis join– Sliding-window join
– Approximation of full join
State1 State2⋈
stanfordstreamdatamanager 44
Simple Query PlanSimple Query Plan
State4⋈State3
Stream1 Stream2
Stream3
Q1 Q2
State1 State2⋈
Scheduler
stanfordstreamdatamanager 45
Some Issues in Query Plan Generation Some Issues in Query Plan Generation
• Compatibility and conversions for streams and relations (+/- streams+/- streams)
• State sharing, incremental computation
• Windowed joinsWindowed joins: Multiway versus 2-way
• Windows in general: push down, pull up, split, merge, …
• Time coordination, operator-level heartbeats
stanfordstreamdatamanager 46
Memory Overhead in Query ProcessingMemory Overhead in Query Processing
• Queues + State
• Continuous queries keep state indefinitely
• Online requirements suggest using memory rather than disk– But we realize this assumption is shaky
stanfordstreamdatamanager 47
Memory Overhead in Query ProcessingMemory Overhead in Query Processing
• Queues + State
• Continuous queries keep state indefinitely
• Online requirements suggest using memory rather than disk– But we realize this assumption is shaky
• Goal: minimize memory use while providing Goal: minimize memory use while providing timely, accurate answerstimely, accurate answers
stanfordstreamdatamanager 48
Reducing Memory OverheadReducing Memory Overhead
Two main techniques to date
1) Exploit constraints on streamsconstraints on streams to reduce statereduce state
2) Clever operator schedulingoperator scheduling to reduce queue sizesreduce queue sizes
stanfordstreamdatamanager 49
Exploiting Stream ConstraintsExploiting Stream Constraints
• For most queries, unbounded memory is required for arbitrary streams [PODS ’01]
stanfordstreamdatamanager 50
Exploiting Stream ConstraintsExploiting Stream Constraints
• For most queries, unbounded memory is required for arbitraryarbitrary streams [PODS ’01]
• But streams may exhibit constraints that reduce, bound, or even eliminate state
stanfordstreamdatamanager 51
Exploiting Stream ConstraintsExploiting Stream Constraints
• For most queries, unbounded memory is required for arbitrary streams [PODS ’01]
• But streams may exhibit constraints that reduce, bound, or even eliminate state
• Conventional database constraints– Redefined for streams
– “Relaxed” for stream environment
stanfordstreamdatamanager 52
Stream ConstraintsStream Constraints
• Each constraint type defines adherence adherence parameterparameter k
• Clustered(k) for attribute S.A
• Ordered(k) for attribute S.A
• Referential-Integrity(k) for join S1 S2
stanfordstreamdatamanager 53
Algorithm for Exploiting ConstraintsAlgorithm for Exploiting Constraints
• Input– Any Select-Project-Join query over streams
– Any set of k-constraints
• Output– Query execution plan that reduces or eliminates
state based on k-constraints
– If constraints violated, get approximate resultIf constraints violated, get approximate result
stanfordstreamdatamanager 54
Constraint ExamplesConstraint ExamplesOrders (orderID, cost)Fulfillments (orderID, portion, clerk)
Query: Many-one join F O
stanfordstreamdatamanager 55
Constraint ExamplesConstraint ExamplesOrders (orderID, cost)Fulfillments (orderID, portion, clerk)
Query: Many-one join F O
• Clustered(k) on F.orderID Matched O tuples discarded after k arrivals of non-
matching F’s
stanfordstreamdatamanager 56
Constraint ExamplesConstraint ExamplesOrders (orderID, cost)Fulfillments (orderID, portion, clerk)
Query: Many-one join F O
• Clustered(k) on F.orderID Matched O tuples discarded after k arrivals of non-
matching F’s
• Referential-Integrity(k) F tuples retained for at most k arrivals of O tuples
stanfordstreamdatamanager 57
Operator SchedulingOperator Scheduling• Global scheduler invokes run method of query
plan operators with “timeslice” parameter
• Many possible scheduling objectives: minimize latency, inaccuracy, memory use, computation, starvation, …
– First scheduler: round-robin
Second scheduler: minimize queue sizes
– Third scheduler: minimize combination of queue sizes and latency
stanfordstreamdatamanager 58
SchedulingScheduling Algorithm Algorithm• Goal: minimize total queue size for
unpredictable, bursty stream arrival patterns
• Proven: within constant factor of any “clairvoyant” strategy for some queries
• Empirical results: large savings over naive strategies for many queries
• But minimizing queue sizes is at odds with minimizing latency
stanfordstreamdatamanager 59
Scheduling Algorithm (contd.)Scheduling Algorithm (contd.)
• Always schedule ready chainready chain with steepest downslope
• Plan progress chartprogress chart: memory delta per unit time
• Independent scheduling of operators makes mistakes
• Consider chainschains of operators
Mem
ory
1 3 4 8
10.9
2
0.2
Op1Op2
Op3
Op4
Computation time
stanfordstreamdatamanager 60
ApproximationApproximation
• Why approximate?– Memory requirement too high, even with
constraints and clever scheduling
– Can’t process streams fast enough for query load
stanfordstreamdatamanager 61
Approximation (cont’d)Approximation (cont’d)
• Static:Static: rewrite queries to add (or shrink) sampling or windows+ User can participate, predictable behavior
– Doesn’t consider dynamic conditions
• Dynamic:Dynamic: modify query plan – insert sampling operators, shrink windows, load shedding+ Adapts to current resource availability
– How to convey to user? (major open issue)(major open issue)
stanfordstreamdatamanager 62
Resource Allocation to Resource Allocation to Maximize PrecisionMaximize Precision
• Query plan: tree of operators
• Each operator has function from resource function from resource allocation allocation RR to precision to precision PP
• Simple (FP,FN) precision model– FP = probability of answer being false positive
– FN = # false negatives per correct answer
stanfordstreamdatamanager 63
Precision of an OperatorPrecision of an Operator
State1 State2⋈
Approx.Answer
ExactAnswer
Precision:Precision: FP = False PositivesFP = False Positives FN = False NegativesFN = False Negatives
stanfordstreamdatamanager 64
Precision of Two OperatorsPrecision of Two Operators
(FP1,FN1)(FP1,FN1)
State2State1 ⋈
Approx.Answer
ExactAnswer
State3
Approx.Answer
ExactAnswer (FP2,FN2)(FP2,FN2)
stanfordstreamdatamanager 65
The Problem StatementThe Problem Statement
• Given a plan, precision function for each operator in the plan, and R total resources, allocate R to operators to maximize result precision
• Solution– For each operator type: formula for calculating For each operator type: formula for calculating
output precision given input precision and operator output precision given input precision and operator resource allocationresource allocation
– Assume <0,0> precision for input streams
– Becomes optimization problem
stanfordstreamdatamanager 66
The Holy GrailThe Holy Grail
• Given:– Declarative query
– Resources
– Constraints on streams
Generate plan and resource allocation that takes advantage of constraints and maximizes precision
• Do it for multiple (weighted) queries, dynamically and adaptively, and convey what’s happening to the user
stanfordstreamdatamanager 67
The Stream Systems LandscapeThe Stream Systems Landscape
• (At least) three general-purpose DSMS prototypes underway– STREAMSTREAM (Stanford)
– AuroraAurora (Brown, Brandeis, MIT)
– TelegraphCQ TelegraphCQ (Berkeley)
• All will be demo’d at SIGMOD ’03
• Cooperating to develop stream system stream system benchmarkbenchmarkGoal: demonstrate that conventional systems are
far inferior for data stream applications