query optimization and intro to transactions r&g, chapter 15 chapter 16 lecture 18

Query Optimizationand Intro to

Transactions

R&G, Chapter 15Chapter 16Lecture 18

Administrivia

• Homework 3 due next Tuesday, March 20 by end of class period

• Homework 4 available on class website– Implement nested loops and hash join operators for (new!)

minibase– Due date: April 10 (after Spring Break)

• Midterm 2 is 3/22, 1 week from today– In class, covers lectures 10-17– Review will be held Tuesday 3/20 7-9 pm 306 Soda Hall

• Internships at Google this summer…– See http://www.postgresql.org/developer/summerofcode– Booth at the UCB TechExpo this Thursday:– http://csba.berkeley.edu/tech_expo.html– Contact [email protected]

Review

Query Optimizationand Execution

Relational Operators

Files and Access Methods

Buffer Management

Disk Space Management

DB

We are here

•Query plans are a tree of operators that compute the result of a query•Optimization is the process of picking the best plan•Execution is the process of executing the plan

Example

Sailors:Tuples 50 bytes long, 80 tuples/page, 500 pages Unclustered B+ and Hash on sid (key)40,000 sid values10 ratingsReserves:Tuples 40 bytes long, 100 tuples/page, 1000 pagesClustered B+ tree on bid (key), Unclustered B+ on sid100 distinct bid valuesBoatsTuples 50 bytes long, 80 tuples/page, 100 pagesUnclustered B+, Clustered Hash on color, 50 distinct color values

Select S.sid, COUNT(*) AS numberFROM Sailors S, Reserves R, Boats BWHERE S.sid = R.sid AND R.bid = B.bid AND B.color = “red” GROUP BY S.sid

Reserves

Sailors

sid=sid

Boats

Sid, COUNT(*) AS numbers

GROUPBY sid

bid=bid

Color=red

Pass 1: Best access method for each relation

• Sailors– No predicates, so File Scan is

best

• Reserves– No predicates so File Scan is

best– What about Index Scan with

B+ index on bid for Reserves?– Keep it in mind…tuples

will come out in join order…might come in handy later for a join on bid…

Sailors:Tuples 50 bytes long, 80 tuples/page, 500 pages Unclustered B+ and Hash on sid (key)40,000 sid values10 ratings

Reserves

Sailors

sid=sid

Boats

Sid, COUNT(*) AS numbers

GROUPBY sid

bid=bid

Color=red

Reserves:Tuples 40 bytes long, 100 tuples/page, 1000 pagesClustered B+ tree on bid (key), Unclustered B+ on sid100 distinct bid values

Pass 1: Best access method for each relation

• Boats– File Scan

100 I/Os– B+ Index Scan on color

2-3 + 80*100/50 =

3 + 160 I/Os = 163 I/Os– Hash Index Scan on color

1.2 + (80*100/50)/80 =1.2 + 2 I/Os

BoatsTuples 50 bytes long, 80 tuples/page, 100 pagesUnclustered B+, Clustered Hash on color, 50 distinct color values

Reserves

Sailors

sid=sid

Boats

Sid, COUNT(*) AS number

GROUPBY sid

bid=bid

Color=red

Cheapest

Always keep around just in case

Book says to keep it because of interesting order

Pass 2

• For each of the plans in pass 1:– generate left deep plans for joins-

consider different order and join methods

Reserves

Boats

bid=bid

Color=red

Sailors

sid=sid

Reserves

Reserves

sid=sid

SailorsColor=red

Reserves

Boats

bid=bid

• Question: what about SailorsXBoats?XX

Pass 2

Reserves

Boats

bid=bid

Color=red

• First consider which pass 1 plans to use for access path

Reserves

Boats

bid=bid

Color=red

1. Hash index(color) Boats, File scan Reserves

2. B+(color), File Scan Reserves

• File scan Reserves, File Scan Boats

Boats: B+ tree(color), Hash(color), File ScanSailors: File ScanReserves: File Scan

Pass 2

• Note: book also includes these plans:– File Scan Sailors (outer) with Boats (inner)– Boats hash on color with Sailors (inner)– Boats Btree on color with Sailors (inner)

• Would you agree these should be considered?

• File Scan Sailors, File Scan Reserves• B+(sid), File Scan Reserves

• File scan Reserves, File Scan Sailors

Sailors

sid=sid

ReservesReserves

sid=sid

Sailors

• First consider which pass 1 plans to use for access path

Boats: B+ tree(bid), Hash(bid), File ScanSailors: File ScanReserves: File Scan

Pass 2

Reserves

Boats

bid=bid

Color=red

• Now consider join methods and consider all access paths for inner

Reserves

Boats

bid=bid

Color=red

1. Hash index(color) Boats, File scan Reserves

2. B+(color), File Scan Reserves

• File scan Reserves, File Scan Boats


Reserves:Clustered B+ tree on bid (key), Unclustered B+ on sidBoatsUnclustered B+, Clustered Hash on color, 50 distinct color values

If you replace file scan of Reserves with B+ index scan, Index Nested Loops is a good choice!

Sort-Merge could also be a good choice becauseTuples come out in bid-order.

Pass 2

• File Scan Sailors, File Scan Reserves• B+(color), File Scan Reserves

• File scan Reserves, File Scan Sailors

Sailors

sid=sid

ReservesReserves

sid=sid

Sailors

• Now consider join methods


Exercise: Which plans would you keep for this set of joins?

Pass 3 and beyond

• For each of the plans retained from Pass 2, taken as the outer, generate plans for the next join– For example, let’s take the sort-merge plan for

BoatsxReserves and add in Sailors:

Sailors

sid=sid

B+ index(sid) Reserves

Sort-merge

Reserves

Boats

bid=bid

Color=red

Hash index(color) Boats

B+ index(bid) Reserves

Sort-merge

From pass 2

Note that this plan will produce tuples in sid order; interesting order for the upcoming GROUP BY.

• GROUP BY, ORDER BY, AGGREGATES are all considered after the join plans are chosen.

Nested Queries (Subqueries)

SELECT S.snameFROM Sailors SWHERE EXISTS (SELECT * FROM Reserves R WHERE R.bid=103 AND R.sid=S.sid)

SELECT S.snameFROM Sailors SWHERE EXISTS (SELECT * FROM Reserves R WHERE R.day = ’01/31/07’)

• Nested queries are parsed into their own ‘query block’ and optimized separately.

• Two kinds: Uncorrelated

Correlated

• An uncorrelated subquery can be computed once per query.

-> Optimizer can choose a plan for the subquery and add temp operator to ‘cache’ the subquery results for use in the rest of the query

Subquery plan

TEMP

Plan for outer query

Nested Queries

• Optimizer chooses a plan for subquery, and treats it as a subroutine to be invoked once per tuple produced by the outer query block.


Nested block to optimize: SELECT * FROM Reserves R WHERE R.bid=103 AND S.sid= outer value

Correlated

Plan for outer query

Subquery plan

Invoked once per tuple produced by outer query block

Nested Queries

• Sometimes correlated subqueries can be rewritten as a non-correlated query– When rewriting, need to be careful to preserve semantics


Equivalent non-nested query:SELECT DISTINCT S.snameFROM Sailors S, Reserves RWHERE S.sid=R.sid AND R.bid=103

Correlated

Returns at most 1 tuple per sailor; need to add DISTINCT to ensure same semantics in rewritten query

• Rewritten query gives the optimizer more choices for optimization, and it is more likely to choose a good plan

Taking it one step further…

• My area of research is data integrationThis is a very common case in the world today:

Query Optimizationand Execution

Relational Operators

Files and Access Methods

Buffer Management

Disk Space Management

DB

DB21

DB22

Oracle PostgresOther DBsFilesSpreadsheets….

Query over multiple DBMS’sQuery over multiple DBMS’s What if I could use the query planning and execution of a DBMS to query over multiple, different DBMSs, and other data sources as well?

Taking it one step further…• Heterogenous Data Integration leads to all sorts of

fun with optimization …– Compensation becomes possible…

• You can do complex SQL queries over simple data sources…even if a spreadsheet can’t do joins, the DBMS optimizing the query can!

– Knowing what other sources can do becomes a factor• Oracle, DB2, Postgres all support slightly different versions of SQL• If the data source isn’t a DBMS, how do I know what operations it

can perform (e.g. can a spreadsheet do projects? Filters? Joins?)

– Where to do the work becomes a factor…• e.g., Should I always push as much of query as possible to another

DBMS?

– Network cost becomes a factor…• When do we ship tuples from original source to DBMS processing

the complete query?

Summary

• Query optimization is an important task in a relational DBMS.

• Must understand optimization in order to understand the performance impact of a given database design (relations, indexes) on a workload (set of queries).

• Two parts to optimizing a query:– Consider a set of alternative plans.

• Must prune search space; typically, left-deep plans only.

– Must estimate cost of each plan that is considered.• Must estimate size of result and cost for each plan node.• Key issues: Statistics, indexes, operator implementations.

Transaction Management Overview

R & G Chapter 16

There are three side effects of acid. Enhanced long term memory, decreased short term memory, and I forget the third.

- Timothy Leary

Query Compiler

query

Execution Engine Logging/Recovery

LOCK TABLE

Concurrency Control

Storage ManagerBUFFER POOLBUFFERS

Buffer Manager

Schema Manager

Data Definition

DBMS: a set of cooperating software modules

Transaction Manager

transaction

Components of a DBMS

Concurrency Control & Recovery

• Very valuable properties of DBMSs– without these, DBMSs would be much less

useful• Based on concept of transactions with

ACID properties (yep…they’re baa-aack…)

• Concurrent execution of independent transactions– utilization/throughput (“hide” waiting for I/Os.)– response time– fairness

• Example:

Statement of Problem

t0:t1:t2:t3:t4:t5:

T1:tmp1 := read(X)

tmp1 := tmp1 – 20

write tmp1 into X

T2:

tmp2 := read(X)

tmp2 := tmp2 + 10

write tmp2 into X

T2 wins, but if there T2 wins, but if there was a slight delay, was a slight delay, maybe T1 would winmaybe T1 would win-> DBMS wants to -> DBMS wants to ensure a predictable ensure a predictable outcome for outcome for concurrent usersconcurrent users

Statement of problem (cont.)

• Arbitrary interleaving can lead to – Temporary inconsistency (ok, unavoidable)– “Permanent” inconsistency (bad!)

• Need formal correctness criteria.

Definitions

• A program may carry out many operations on the data retrieved from the database

• However, the DBMS is only concerned about what data is read/written from/to the database.

• transaction - a sequence of read and write operations (read(A), write(B), …)– DBMS’s abstract view of a user program

Correctness criteria: The ACID properties

• AA tomicity: All actions in the Xact happen, or none happen.

• CC onsistency: If each Xact is consistent, and the DB starts consistent, it ends up consistent.

• II solation: Execution of one Xact is isolated from that of other Xacts.

• D D urability: If a Xact commits, its effects persist.

Atomicity of Transactions

• Two possible outcomes of executing a transaction:– Xact might commit after completing all its actions– or it could abort (or be aborted by the DBMS) after

executing some actions.

• DBMS guarantees that Xacts are atomic. – From user’s point of view: Xact always either

executes all its actions, or executes no actions at all.

AA

Mechanisms for Ensuring Atomicity

• Main approach: LOGGING– DBMS logs all actions so that it can undo the

actions of aborted transactions.

• Logging used by modern systems, because of need for audit trail and for efficiency reasons.

AA

Transaction Consistency

• “Consistency” - data in DBMS is accurate in modeling real world and follows integrity constraints

• User must ensure transaction consistent by itself– I.e., if DBMS consistent before Xact, it will be after

also

consistent database

S1

consistent database

S2

transaction T

•Key point:

CC

Transaction Consistency (cont.)

• Recall: Integrity constraints– must be true for DB to be considered consistent– Examples:

1. FOREIGN KEY R.sid REFERENCES S2. ACCT-BAL >= 0

• System checks ICs and if they fail, the transaction rolls back (i.e., is aborted).– Beyond this, DBMS does not understand the

semantics of the data.– e.g., it does not understand how interest on a

bank account is computed

CC

Isolation of Transactions• Users submit transactions, and • Each transaction executes as if it was running by

itself.– Concurrency is achieved by DBMS, which interleaves

actions (reads/writes of DB objects) of various transactions.

• Many techniques have been developed. Fall into two basic categories:– Pessimistic – don’t let problems arise in the first

place– Optimistic – assume conflicts are rare, deal with

them after they happen.

II

Example• Consider two transactions (Xacts):

T1: BEGIN A=A+100, B=B-100 ENDT2: BEGIN A=1.06*A, B=1.06*B END

• 1st xact transfers $100 from B’s account to A’s

• 2nd credits both accounts with 6% interest.• Assume at first A and B each have $1000.

What are the legal outcomes of running T1 and T2???• $2000 *1.06 = $2120

• There is no guarantee that T1 will execute before T2 or vice-versa, if both are submitted together. But, the net effect must be equivalent to these two transactions running serially in some order.

II

Example (Contd.)• Legal outcomes: A=1166,B=954 or A=1160,B=960• Consider a possible interleaved schedule:

T1: A=A+100, B=B-100 T2: A=1.06*A, B=1.06*B

This is OK (same as T1;T2). But what about:

T1: A=A+100, B=B-100 T2: A=1.06*A, B=1.06*B

• Result: A=1166, B=960; A+B = 2126, bank loses $6

• The DBMS’s view of the second schedule:T1: R(A), W(A), R(B), W(B)T2: R(A), W(A), R(B), W(B)

II

Formal Properties of Schedules

• Serial schedule: Schedule that does not interleave the actions of different transactions.

• Equivalent schedules: For any database state, the effect of executing the first schedule is identical to the effect of executing the second schedule.

• Serializable schedule: A schedule that is equivalent to some serial execution of the transactions.

(Note: If each transaction preserves consistency, every serializable schedule preserves consistency. )

II

query optimization and intro to transactions r&g, chapter 15 chapter 16 lecture 18

Documents