10.1.1.48.3822

7/28/2019 10.1.1.48.3822

1/19

7/28/2019 10.1.1.48.3822

2/19

An Algebraic Framework for Physical OODB Design

satisfying all the design requirements. Then, the database implementor species the physical design in such a way thatthe performance of the resulting system is acceptable for the needs of this application. This person is also responsiblefor tuning the database to cope with new performance requirements. Finally, the application programmer submits a

logical query against the database without any knowledge of the physical design. The query translator translates thequery into a physical plan that reects the physical design and ideally runs faster than any other equivalent plan. Thequery evaluator executes this plan and returns the result to the application programmer.

Query translation in our framework is purely algebraic and can be easily validated for correctness. In our frame-work, the physical database design has an internal schema that species the structure of the internal database state, anabstractionfunction [11] thatmaps the internalschema into the conceptual schema, and a set of constraints that capturethe alternative access paths (such as secondary indices, materialized functions and views). The abstraction function isa logical view of the physical database. This function always exists, since otherwise there would be some semanticinformation lost when the conceptual database is mapped into the physical storage. Given the conceptual schema of anOODB and a set of physical design directives, we have an automated method for generating the physical schema, theabstraction function, and the plan transformers (this is the optimizer generation component in Figure 1). This methodis the focus of the paper. It is expressed in rule form, requiring only one rule per physical design directive, and allows

extensions to more complex physical design methods.Our physical design framework requires that both conceptual and physical data structures, as well as the operations

upon them, be dened in the same language. The language used in this paper is called the monoid comprehensioncalculus [9, 10] because it is based on monoids and monoid comprehensions. Logical collection types, such as sets,lists, and bags, as well as physical data types, such as B-trees and hash tables, can be captured as monoids.

Logical queries are equivalent to queries against the conceptual database built from the internal database via theabstraction function. That is, any logical query can be transformed to a program that manipulates the physical databaseif we replace all references to the conceptual database state in the query with the logical view of the physical databasestate. The query translation process in our framework consists of substituting R ( D B ) for all occurrences of d b in alogical query and normalizing the resulting program, where d b is the conceptual database state, D B is the physicaldatabase state, and R is the abstraction function (this is the composition component in Figure 1).

We give a normalizationalgorithm that removes all the unnecessary intermediate logical structures, in such a waythat the resulting normalized program does not actually materialize any part of the conceptual database. The resultingprogram (the physical plan in Figure 1) is thus a query that directly manipulates the physical database. That is, if theabstraction function is expressed in the monoid calculus, then any query in the monoid calculus that manipulates theconceptual database can be efciently translated into a query that manipulates only the physical database. Even thoughthe abstraction function builds the entire conceptual database from the physical database, no part of this constructionwill actually take place if we normalize the resulting query. The normalization algorithm is purely algebraic, simple,and efcient.

Access path selection is achieved by substituting C i

( D B ) for D B in the derived physical plan, where C i

is a plantransformer, and then normalizing the resulting program (this step is the plan generation component in Figure 1). Thisphase can be combined with the application of commutativity and associativity rules for monoid comprehensions.There is no need of using a rewrite system for these transformations, since we only use three types of rules: an ap-plication of a plan transformer, associativity, and commutativity. In fact an optimizer based on dynamic programming,

such as the one for System R [19], would be sufcient for our purpose. In that case, the costing component in Figure 1could be combined with the plan generation component.

In addition to query translation,in this paperwe report an automated methodfor translatingdatabase updates againstthe conceptual database state into updates against the physical database.

The contributions of this paper are twofold. First, we present a declarative language for specifying physical designdirectives for an OODB management system that captures many recent proposals for OODB physical design. Second,we present a method fortranslating these directivesinto a formthat facilitatesan automated translationof logical queriesand updates. The program translation as well as the elimination of the intermediate logical structures in the resultingprogram is based on a formal model.

5th International Workshop on Database Programming Languages, Gubbio, Italy, 1995 2

7/28/2019 10.1.1.48.3822

3/19


-

6

-

?

?

A

A

A

A

A

A K

A

A

A

A

A

A U

X

X

X

X

X

X

X

X

X

X

X

X z

:

Z

Z

Z

Z

Z

Z

R

C

n

C

1

compositionquery query R

normalization

physical planabstraction function

optimizergeneration

plangeneration

physical plansalternative

physicaldesign

conceptualschema

best plan

costing

plan transformers

...

Figure 1: The Query Translation Architecture

2 Background

Queries in our framework are transformed into physical plans by a number of renement steps. Thus, they need tobe compiled into an algebraic form that captures both logical and physical operators. More importantly, the algebraicforms derived after query translation need to be normalized in a way that no intermediate logical structures are con-structed during the evaluation of these forms. In this section we give a brief overview of the monoid comprehensioncalculus, which fullls these two requirements. For a complete formal description of the calculus, which includes ad-vanced data structures such as vectors, matrices and object identity, the reader is referred to our previous work [9, 10] .

2.1 The Monoid Comprehension Calculus

A data type T in our calculus is expressed as a monoid M with a unit function:

M = ( T ; zero ; unit ; merge )

where the function merge, of type T T ! T , is associative with left and right identity zero. If in addition merge iscommutative (idempotent, i.e., 8 x : merge ( x ; x ) = x ), then the monoid is commutative (idempotent). For example,( set ( ) ; f g ; f ; ) , where f ( x ) = f x g , is a commutative and idempotent monoid while ( int ; 0 ; g ; + ) , where g ( x ) = x ,is a commutative monoid. When necessary to distinguish the components of a particular monoid M we qualify themas zero M , unit M , and merge M .


7/28/2019 10.1.1.48.3822

4/19

7/28/2019 10.1.1.48.3822

5/19


Rules 2 and 3 reduce a comprehension in which the leftmost qualier is a lter, while rules 4-6reduce a comprehensionin which the leftmost qualier is a generator.

This denition of a comprehension provides an equational theory that allows us to prove the soundness of various

transformations, including the translation of comprehensions into efcient joins.The monoid comprehension is the only form of bulk manipulation of collection types supported in our calculus.

But monoid comprehensions are very expressive. In fact, a small subset of these forms, namely the monoid compre-hensions from sets to sets, captures precisely the nested relational algebra (since they are equivalent to the set monadcomprehensions [6]). For example, the nesting operator for nested relations is

nest ( k ) x = s e t f h KEY = k ( e ) ; P = s e t f a | a x ; k ( e ) = k ( a ) g i | e x g

Similarly, the unnesting operator is

unnest ( x ) = s e t f e | s x ; e s : P g

The last comprehension is an example of a dependent join in which the value of the second collection s : P depends onthe value of s , an element of the rst relation x . Dependent joins are a convenient way of traversing nested collections.

But monoid comprehensions go beyond the nested relational algebra to capture operations over multiple collectiontypes, such as the join of a list with a bag that returns a set, plus predicates and aggregates. For example,

s e t f ( x ; y ) | x 1 ; 2 ] ; y f f 3 ; 4 ; 3 g g g = f ( 1 ; 3 ) ; ( 1 ; 4 ) ; ( 2 ; 3 ) ; ( 2 ; 4 ) g

Another example is s u m f a | a 1 ; 2 ; 3 ] ; a 2 g , which returns 5, the sum of all list elements greater than orequal to 2. They can also capture physical algorithms, such as the merge join:

s o r t e d f ] f a | a x ; b y ; f ( a ) = g ( b ) g

where x is an instance of a sorted f ] monoid and y of a sorted g ] monoid ( f and g are not necessarily the same). Thatis, this comprehension behaves exactly like a merge-join: it receives two sorted lists as input and it generates a sortedlist as output. Even thoughthe naive interpretationof this program derived fromthe comprehension denition(Rules1through 6) is quadratic, we will see later that there are some effective ways of assigning specialized execution algo-rithms to these programs. In that case, the program will be a real merge join. This assignment to efcient executionalgorithms is possible by examining the types of the generator domains in a comprehension.

The following are some more examples of comprehensions:

lter ( p ) ( x ) = s e t f e | e x ; p ( e ) g atten ( x ) = s e t f e | s x ; e s gx \ y = s e t f e | e x ; e 2 y g length ( x ) = s u m f 1 | e x gsum ( x ) = s u m f e | e x g count ( x ; a ) = s u m f 1 | e x ; e = a g9 a 2 x : e = s o m e f e | a x g 8 a 2 x : e = a l l f e | a x ga 2 x = s o m e f a = e | e x g

The expression sum ( x ) adds the elements of any non-idempotent monoid x , e.g., sum ( 1 ; 2 ; 3 ] ) = 6 . The expres-sion count ( x ; a ) counts the number of occurrences of a in the bag x , e.g., count ( f f 1 ; 2 ; 1 g g ; 1 ) = 2 .

Thecalculus hasa semantic well-formedness requirement thata comprehension be over an idempotent or commuta-tive monoid if any of its generators are over idempotent or commutative monoids. For example, l i s t f x | x f 1 ; 2 g g

is not a valid monoid comprehension, since it maps as e t

(which is both commutative and idempotent) to al i s t

(whichis neither commutative nor idempotent), while s u m f x | x f f 1 ; 2 g g g is valid (since both b a g and s u m are commu-tative). This requirement can be easily checked during compile time [9].

We will use the following convention to represent variable bindings in a comprehension:

M f e | r ; x u ; s g ? ! M f e u = x ] | r ; s u = x ] g (7)

where e u = x ] is the expression e with u substituted for all the free occurrences of x (i.e., e u = x ] is equivalent to let x = u in e ). A term of the form x u is called a binding since it binds the variable x to the expression u . For example,

s e t f b : D j a x ; b y ; a : B = b : C g = s e t f y : D j a x ; a : B = y : C g


7/28/2019 10.1.1.48.3822

6/19


2.2 Program Normalization

The monoid calculus can be put into a canonical form by an efcient rewrite algorithm, called the normalization al-gorithm (described in detail elsewhere [10]). The evaluation of these canonical forms generally produces fewer in-termediate data structures than the initial unnormalized programs. Moreover, the normalization algorithm improvesprogram performance in many cases. The normalization algorithm will be used as a prephase to our query evaluatorsince canonical forms are a convenient program representation that facilitate program transformation. The physicaldesign framework described in Section 3 uses this algorithm to eliminate value coercions introduced when mappinglogical queries into physical programs.

The normalization algorithmis a pattern-based rewriting algorithm. One example of a rewriting rule that this algo-rithmuses is unnesting nested comprehensions (i.e., comprehensions that contain a generator whose domain is anothercomprehension):

M f e | r ; v N f e 0 | t g ; s g ? ! M f e | r ; t ; v e 0 ; s g (8)

Rules 7 and 8 are themost complex rulesof thenormalization algorithm. The other rules includetrivialreductions, suchas a projection over a tuple construction results into a tuple component. Rule 8 may require some variable renaming

to avoid name conicts. The following is an example of a program normalization that requires variable renaming. Theprogram lter(p)(lter(q) x) is computed by

set f a | a set f a | a x, q(a) g , p(a) g= set f a | a set f b | b x, q(b) g , p(a) g

(by renaming variable a to b ) and is normalized into

? ! set f a | b x, q(b), a b, p(a) g (by Rule 8)? ! set f b | b x, q(b), p(b) g (by Rule 7)

A path p a t h is a n a m e (the identier of a bound variable, or the identier of a persistent variable, or the name of a class extent) or an expression p a t h 0 : n a m e (where n a m e is an attribute name of a record and p a t h 0 is a path). If thegenerator domains in a comprehension (i.e., expressions e in v e ) do not contain any non-commutativemerges (such

as the list append), then these domains can be normalized into paths [10]. In the next section we will use the followingshorthand: A path expression (as it is dened in [12]) is an expression of the form d b : pth1

: pth2

: : : : pthn + 1

, where each pth

i

is a path and d b is the conceptual database state, and whose interpretation in our calculus is

s e t f v

n

: pthn + 1

| v1

d b : pth1

; v

2

v

1

: pth2

; : : : ; v

n

v

n ? 1

: pthn

g

In addition to the normalization rules, there are other important program transformations that explore the commu-tativity properties of monoids. In particular, if M is a commutative monoid, then we have the following join commu-tativity rule:

M f e | r ; v1

e

1

; v

2

e

2

; s g ? ! M f e | r ; v2

e

2

; v

1

e

1

; s g

which holds only when term e2

does not depend on v1

. The following transformation, which is valid for any monoidM , pushes a selection before a join if p r e d does not depend on v :

M f e | r ; v e 1 ; p r e d ; s g ? ! M f e | r ; p r e d ; v e 1 ; s g

3 Physical Design

In thissectionwe show howto translatequeriesagainst theconceptual database into queries against thephysical databasein a way that reects a user-specied physical design. The translation process is described through examples that il-lustrate the basic idea. The physical design language is presented in Section 4 while the rules for generating the querytranslator from a physical design are presented in Section 5. In the rst example we normalize a nested relation. Weintentionally kept this example simple so that one can easily express the abstraction function and the plan transformers


7/28/2019 10.1.1.48.3822

7/19


by simply observing the conceptual and the physical schema. These observations will help us understand how theseprograms are generated automatically by the optimizer-generation component of our translator. We use these programsto translate a logical query into a physical plan and to derive alternative plans. The second example is more complex.

It is based on a conceptual OODB schema with a complex physical design. The purpose of this example is to supportour claim that the same theory can be easily scaled up to capture more complex designs.

3.1 Example 1: Mapping Nested Relations into Flat Relations

Consider the following NF 2 conceptual database schema:

db: set( h A: int, B: set( h C: int, D: int i ), E: int i )

Suppose that we want to implement this schema using at table structures. The standard approach is to normalize thenested collection into two tables T1 and T2 : table T1 holds the outer set while table T2 holds the union of all theinner sets. Then, whenever a query manipulates the initial nested collection, this nested collection is reconstructed viaan implicit join. Furthermore, suppose that we want to implement the set as a B-tree indexed by A and we want toadd a secondary index (also implemented as a B-tree) indexed by E . Using our physical design language (that will be

described in detail in Section 4), this specication is expressed by the following physical design directives:directives = f implement( db, sorted[A] ), (1)

normalize( db.B ), (2)secondary( db, E ) g (3)

Directive (1) indicates that the outer set be implemented as a B-tree indexedby A. Directive (2) indicates that the nestedset (reached by the path expression db.B ) be normalized. Directive (3) indicates that there will be a secondary indexattached to the outer set. One possible internal (physical) schema that captures this design is the following:

DB: h T1: sorted[A]( h A: int, B: h i , E: int i ),T2: sorted[#]( h #: TID, INFO: h C: int, D: int i i ),T3: sorted[E]( h #: TID, E: int i ) i

where h i is the empty record, which indicates that the B attribute in T1 is of no interest, since the inner set in the con-ceptual database is normalized into T2 . Each record in the physical schema is associated with a tuple identier (of type TID) that holds the actual location of this record on disk. The tuple identier of a record x is accessed by @ x .The # attributes in T2 and T3 hold tuple identiers. Sequence T1 is implemented as a sequence sorted by A, that is,8 x ; y 2 T1 : @ x @ y ) x : A y : A. A similarequation holds for the secondary index T3 . Sequence T2 is indexedby the # attribute, that is, 8 x ; y 2 T2 : @ x @ y ) x : # y : #. If x 2 T2 is a child of y 2 T1 , then x : # = @ y .The inner set of the conceptual database is implemented as a sorted[#] sequence so that the join between T1 and T2over the join predicate x : # = @ y , which reconstructs the nested set, can be performed as a merge join. Similarly, foreach x 2 T1 there is y 2 T3 such that y : # = @ x and y : E = x : E.

Let R be the abstraction function that maps the physical schema DBtype to the conceptual schema dbtype . Thatis, if db of type dbtype is the database state as a user sees it and DB of type DBtype is the actual database state as itis stored on disk, then db = R ( DB ) . For our example, we have:

R (DB) = set f h A = a.A,

B = set f h

C=b.INFO.C, D=b.INFO.Di

| b

DB.T2, b.#=@ag

,E = a.E i| a DB.T1 g

In addition, there is a relationship between the table T1 and its secondary index T3 . This relationship can be capturedby the function C (a plan transformer), which represents a referential integrity constraint on the physical schema:

C (DB) = h @=@DB,T1 = sorted [A]f h @=@a, A=a.A, B=a.B, E=b.E i

| a DB.T1, b DB.T3, b.#=@a g ,T2 = DB.T2, T3 = DB.T3 i


7/28/2019 10.1.1.48.3822

8/19


The equation C ( D B ) = D B is true for any database instance DB because of the information redundancy introduced bythe secondary index. This equation indicates that the values stored in table T1 can also be retrieved by joining T1 withT3 . That is, if a tuple b of the secondary index T3 is located (e.g., by providing the value b.E ), then the associated

tuple a of T1 is located by the equijoin. The tuple identier @ of the resulting tuples in T1 is set to @a so that thetuples in T1 have the same tuple identiers as those generated by the comprehension. That is, the TID @ is handledas a record attribute, even though it does not occupy any physical space. This function makes the tuple identiers of all the records in DB equal to the tuple identiers generated by the expression in the C denition.

An abstract query is a function f over the conceptual database db . For example:

f (db) = sum f y.C | x db, y x.B, x.A=10, y.D > 5 g

The implementation of f ( db ) is F ( DB ) = f ( R ( DB ) ) :

F (DB) = sum f y.C | x R (DB), y x.B, x.A=10, y.D > 5 g= sum f y.C | x set f h A=a.A, B= set f h C=b.INFO.C, D=b.INFO.D i

| b DB.T2, b.#=@a g , E=a.E i

| a DB.T1 g ,y x.B, x.A=10, y.D > 5 g

If we normalize this expression using our normalization algorithm, we get:

? ! sum f y.C | a DB.T1, (by Rule 8)x h A=a.A, B= set f h C=b.INFO.C, D=b.INFO.D i

| b DB.T2, b.#=@a g , E=a.E i ,y x.B, x.A=10, y.D > 5 g

? ! sum f y.C | a DB.T1, (by Rule 7)y set f h C=b.INFO.C, D=b.INFO.D i

| b DB.T2, b.#=@a g ,a.A=10, y.D > 5 g

? !

sum f

y.C | a

DB.T1, b

DB.T2, b.#=@a, (by Rule 8)y h C=b.INFO.C, D=b.INFO.D i , a.A=10, y.D > 5 g? ! sum f b.INFO.C | a DB.T1, b DB.T2,

b.#=@a, a.A=10, b.INFO.D > 5 g (by Rule 7)

We see that the initial dependent join, which was over a nested collection, is attened into an 1NF join. Notice thatDB.T1 is sorted by both @ and A attributes while DB.T2 is sorted by @ and # . That is, the derived program has thefunctionality of a sort-merge join since the join predicate is b.#=@a . This functionality can be deduced directly fromthe types of the comprehension generators. In contrast to most query optimization approaches, the programs derivedin our framework are guaranteed to be correct since our framework uses transformations that are purely algebraic andmeaning preserving.

The alternativeaccess pathof usingthesecondary index T3 can be derived from theequation F 0 ( D B ) = F ( C ( D B ) ) :

F

0

(DB) = sum f

b.INFO.C | aC

(DB).T1, bC

(DB).T2,b.#=@a, a.A=10, b.INFO.D > 5 g= sum f b.INFO.C | a sorted [A]f h @=@c, A=c.A, B=c.B, E=d.E i

| c DB.T1, d DB.T3, d.#=@c g ,b DB.T2, b.#=@a, a.A=10, b.INFO.D > 5 g (by C def)

? ! sum f b.INFO.C | c DB.T1, d DB.T3, d.#=@c,a h @=@c, A=c.A, B=c.B, E=d.E i , (by Rule 8)b DB.T2, b.#=@a, a.A=10, b.INFO.D > 5 g

? ! sum f b.INFO.C | c DB.T1, d DB.T3, b DB.T2, (by Rule 7)d.#=@c, b.#=@c, c.A=10, b.INFO.D > 5 g


7/28/2019 10.1.1.48.3822

9/19


The resulting program is an alternative plan to evaluate the initial logical query. It is a 3-way sort-merge join that cor-responds to the alternative access path associated with the secondary index T3 . Both programs F 0 ( D B ) and F ( D B )should be considered by the query optimizer for costing. If there were many integrity constraints because of multiple

access paths, then an optimization step would consist of selecting one of the plan transformers C , substituting C ( D B )for DB in the current program, and normalizing the resulting program. The optimization process consists of the explo-ration of all the alternative programs generated by applying this optimization step multiple times as well as of usingthe commutativity and associativity properties of monoids.

3.2 Example 2: OODB Physical Design

The example presented here translates an OODB query into a physical plan that reects an OODB physical design. Theconceptual database schema is the following:

class hotel = h name: string, address: string, facilities: set(string),rooms: set( h beds: int, price: int i ) i

extent: hotels;

class city = h name: string, hotels: bag(hotel),places to visit: list( h name: string, address: string i ) i

extent: cities;

where the extent name is a collection of all instances of a class. The database schema db associated with this speci-cation is the aggregation of all class extents along with a number of persistent variables. To make our examples short,though, we will assume that there are no persistent variables. In that case, db has type:

h hotels: set(hotel), cities: set(city) i

As we mentioned earlier, physical design in our framework consists of a set of physical design directives speci-ed by the database implementor. In order to reduce the number of required physical directives, we assume a defaultimplementation for the database. Then the physical design directives are commands to change these defaults.

In the default implementation, objects from two different classes are not clustered together. That is, the hotelsextent will be stored in a different storage collection than the cities extent, while each cities.hotels bag will be a bagof OIDs 1 that reference hotels. But the database implementor can cluster cities and hotels together by stating the rightphysical directive. The default implementation for a nested collection, such as the hotels.rooms , is the direct storagemodel [23]: all hierarchical object structures are stored in preorder form. For example, hotels and hotels.rooms areclustered together, with the rooms of a hotel stored adjacent to the hotel.

Thefollowingis an example of physicaldesign directives specied by thedatabase implementorduringthe physicaldesign of the previous OODB example:

directives = f implement( cities, sorted[name] ), (1)implement( hotels, sorted[name] ), (2)secondary( hotels, address ), (3)normalize( cities.hotels ), (4) join index( hotels.rooms ) g (5)

Directives (1) and (2) indicate that both cities and hotels will be implemented as B-trees indexed by name . Direc-tive (3) indicates that a secondary index on attribute address will be attached to hotels . Directive (4) indicates thatcities.hotels will be normalized. The conceptual nested collection is reconstructed by a join. Directive (5) requests abinary join index for hotels.rooms . This directive implies that hotels.rooms be normalized and that there will be anadditional index for accelerating the join between the normalized tables.

According to these physical design directives, the physical schema DB for our OODB example is the following:(it is automatically generated by a program described in Section 5)

1 We decided to capture OIDs as tuple identiers only to make the algorithms and examples easier to understand. A better alternative for OIDsmight be to use surrogates, i.e., system generated unique numbers.


7/28/2019 10.1.1.48.3822

10/19


h hotels: sorted[name]( h name: string, address: string,facilities: list(string), rooms: h i i ),

cities: sorted[name]( h name: string, hotels: h i ,

places to visit: list(h

name: string, address: stringi

)i

),cities hotels: sorted[#]( h #: TID, INFO: TID i ),hotels rooms: sorted[#]( h #: TID, INFO: h beds: int, price: int i i ),hotels rooms JI: sorted[FROM]( h FROM: TID, TO: TID i ),hotels address: sorted[address]( h #: TID, address: string i ) i

The abstraction function R ( D B ) , which is also generated automatically, is the following:

R (DB) = h hotels = set f h name = b.name, address = b.address,facilities = set f x | x b.facilities g ,rooms = set f h beds=r.INFO.beds, price=r.INFO.price i

| i DB.hotels rooms JI, r DB.hotels rooms,i.FROM=@b, i.TO=@r g i

| b DB.hotels g ,

cities = set f h

name = a.name,hotels = bag f @x | b DB.cities hotels, x DB.hotels,b.#=@a, @x=b.INFO g ,

places to visit = list f h name=c.name, address=c.address i| c a.places to visit g i

| a DB.cities g i

That is, the set of rooms in a hotel b is reconstructed by joining the normalized table hotels rooms with the join indexhotels rooms JI . The set of all hotel references cities.hotels in a city a is reconstructed by joining the normalizedtable cities hotels with the hotels extent.

The plan transformer generated (because of the secondary index) is the following:

C (DB) = h @=@DB,hotels = sorted [address] f h @=@x, name=x.name, address=y.address,

facilities=x.facilities, rooms=x.roomsi

| x DB.hotels, y DB.hotels address, y.#=@x g ,cities=DB.cities, cities hotels=DB.cities hotels, hotels rooms=DB.hotels rooms,hotels rooms JI=DB.hotels rooms JI, cities address=DB.cities address i

We now translate a logical query against our OODB schema into a physical plan:

set f h.name | c db.cities, h c.hotels, p c.places to visit,c.name=Portland, h.name=p.name g

This query nds all hotels in Portland that are also interesting places to visit. It is translated into

set f h.name | c R (db).cities, h c.hotels, p c.places to visit,c.name=Portland, h.name=p.name g

which, when normalized by the Rules 8 and 7, becomes

set f x.name | a DB.cities, c a.places to visit, b DB.cities hotels,x DB.hotels, @x=b.INFO, b.#=@a,a.name=Portland, x.name=c.name g

Observe that this query is purely in terms of physical storage structures and has no nested comprehensions, hence itis not reconstructing any of the structures in the conceptual database. The resulting program is still a dependent joinsince c is derived from a.places to visit . But the collection DB.cities.places to visit is not normalized. Therefore,all places to visit are clustered together with the cities. Hence, when a city a is retrieved, all places to visit in a areretrieved as well.

If we use the secondary index secondary(hotels,address) , the previous program becomes


7/28/2019 10.1.1.48.3822

11/19


set f x.name | a C (DB).cities, c a.places to visit, b C (DB).cities hotels,x C (DB).hotels, @x=b.INFO, b.#=@a,a.name=Portland, x.name=c.name g

which, when normalized by the Rules 8 and 7, becomes

set f y.name | a DB.cities, c a.places to visit,b DB.cities hotels, y DB.hotels, z DB.hotels address,z.#=@y, @y=b.INFO, b.#=@a, a.name=Portland, y.name=c.name g

4 Physical Design Specication

The following is the detailed description of the physical design directives. This description is by no means a completelist. It can be easily extended to incorporate new physical design techniques, new storage structures, and new phys-ical algorithms. Such extensions are easy to incorporate because, as we will see next, each design technique can beexpressed in a declarative way, in a form of a rule that is independent of the other rules. We have been experimenting

with vertical partition of collections, hierarchical join indices [23], implementation of OIDs with surrogates, materi-alized functions and views, and denormalization [17] (where two collections that are not nested together are stored asa nested collection), but we decided not to include them here to simplify the exposition of the translation algorithms.The physical design directives are the following:

implement ( p a t h ; M ) : sets the implementation of the collectionreached by the path expression p a t h to M . (Themonoid M represents a storage structure, such as an ordered list, a hash table, etc.)

secondary ( p a t h ; a t t r b ) : attaches a secondary index on attribute a t t r b to the collection reached by p a t h (in ad-ditionto the possible primary index specied by the implement directive). The secondary index may be attachedto a deeply nested collection.

normalize ( p a t h ) : normalizes the nested collections reached by p a t h into one collection. Each element of thiscollection contains a reference (a TID) to its owner object. The original nested collection can be reconstructedby joining the p a t h with this collection.

join index ( p a t h ) : is like normalize ( p a t h ) but it also creates a binary join index to speed up the join betweenthe p a t h and the normalized collection.

cluster ( p a t h ) : p a t h shouldbe eithera reference to a class ora collectionof class references (such as set(person) ).It clusters the class instances reached by p a t h together with the p a t h (instead of storing these instances into theclass extent).

partition ( p a t h ; f ) : species a horizontal partition of the collection reached by p a t h . Function f is the partitionfunction. Two elements x and y of the collection belong to the same partition if f ( x ) = f ( y ) . If the collectione (an instance of M ) is reached by p a t h , then the horizontal partitions are computed as follows:

s o r t e d KEY ] f h KEY = f ( x ) ; PARTITION = M f a | a e ; f ( a ) = f ( x ) g i | x e g

5 The Optimizer Generator

The following algorithms generate the physical schema, the abstraction function, and the semantic constraints fromthe conceptual schema and from the physical design directives. To make the algorithms simple, we assumed that thephysical designdirectives have been checked for semantic correctness and forpossibleconicts beforethey fed to thesealgorithms (e.g., all expression paths in the directives are valid within the conceptual database schema).


7/28/2019 10.1.1.48.3822

12/19

7/28/2019 10.1.1.48.3822

13/19


E ( basic type ] ] ; p a t h ; e1

; e

2

) ! e

1

E (

class name] ] ; p a t h ; e

1

; e

2

) :

cluster( p a t h )

! @ ( E ( t y p e ( db.class extent ) ] ] ; db.class extent ; e1

; e

2

) )

E ( class name ] ] ; p a t h ; e1

; e

2

)

! p i c k f @ ( E ( t y p e ( db.class extent ) ] ] ; db.class extent ; x ; x ) )| x DB.class extent ; @ x = e

1

g

E ( h A

1

: t

1

; : : : ; A

n

: t

n

i ] ] ; p a t h ; e

1

; e

2

)

! h A

1

= E ( t

1

] ] ; p a t h : A

1

; e

1

: A

1

; e

2

) ; : : : ; A

n

= E ( t

n

] ] ; p a t h : A

n

; e

1

: A

n

; e

2

) i

E ( M ( t ) ] ] ; p a t h ; e

1

; e

2

) : join index ( p a t h )! M f E ( t ] ] ; p a t h ; r : INFO ; r ) | i DB : p a t h JI ; r DB : p a t h ; i : FROM = @ e

2

; i : TO = @ r g

E ( M ( t ) ] ] ; p a t h ; e

1

; e

2

) : normalize ( p a t h )! M f E ( t ] ] ; p a t h ; x : INFO ; x ) | x DB : p a t h ; x : # = @ e

2

g

E ( M ( t ) ] ] ; p a t h ; e

1

; e

2

) : partition ( p a t h ; f )! M f y | x e

1

; y E ( M ( t ) ] ] ; p a t h ; x : PARTITION ; x : PARTITION ) ; f ( y ) = x : KEY g

E ( M ( t ) ] ] ; p a t h ; e

1

; e

2

) ! M f E ( t ] ] ; p a t h ; x ; x ) | x e1

g

Figure 3: Generation of the Abstraction Function

the piece of the abstraction function that corresponds to this type. All free variable names that appear in a rule ac-tion need to be made unique to avoid the variable capture problem. The entire abstraction function is generated byE ( dbtype ] ] ; db ; DB ; DB ) .

The primitive monoid pick in the third rule is over tuple identiers. Its zero value is null, its unit function is the iden-tity function, and its merge function satises merge p i c k ( null ; x ) = x , otherwise merge p i c k ( x ; y ) = x . For example, pick f @x | x DB.hotels, @x=h g dereferences a hotel from the class extent DB.hotels using the TID h. If there is no such hotel, then it returns null. If there are more than one hotel (this never happens, since TIDs are unique),then it returns the rst one.

The f ( y ) = x : KEY predicate in the next-to-last rule in Figure 3, which checks for a partition, is redundantbecauseof theway thispartition wasconstructed. But,if therewere a generator v e in a comprehension, where e is partitionedby f , and a predicate f ( v ) = c o n s t a n t , then it is translated into x e ; y x : PARTITION ; f ( y ) = x : KEY ; f ( y ) = c o n s t a n t , which implies x : KEY = c o n s t a n t . That way, only the partition with the specied KEY is retrieved.

Algorithm 3 (Generation of the Semantic Constraints) For each such directive secondary ( l p a t h ; a t t r b ) , we gen-erate the function

C

l p a t h

( DB ) = S ( DBtype ] ] ; DB ; p p a t h )

where DBtype = T ( dbtype ] ] ; db ) is the physical database type and p p a t h is the physical path expression that corre-sponds to the logical path expression l p a t h . Function S is dened as follows:

S ( h A

1

: t

1

; : : : ; A

n

: t

n

i ] ] ; e ; A

i

: p a t h )

! h @ = @ e ; A

1

= e : A

1

; : : : ; A

i

= S ( t

i

] ] ; e : A

i

; p a t h ) ; : : : ; A

n

= e : A

n

i

S ( M ( h A

1

: t

1

; : : : ; A

n

: t

n

i ) ] ] ; e ; ; )

! M f h @ = @ x ; A

1

= x : A

1

; : : : ; a t t r b = y : a t t r b ; : : : ; A

n

= x : A

n

i

| x e ; y DB : l p a t h a t t r b ; y : # = @ x g

S ( M ( t ) ] ] ; e ; p a t h ) ! M f S ( t ] ] ; x ; p a t h ) | x e g


7/28/2019 10.1.1.48.3822

14/19


where ; denotes the empty path expression.

For example, if we had specied the directive

secondary( cities.places to visit, name )

we would have the following constraint:

C

c : p t v

DB = h @=@DB, hotels=DB.hotels,cities = sorted [name] f h @=@a, name=a.name, hotels=a.hotels,

places to visit = list f h @=@b, name=c.name, address=b.address i| b a.places to visit,

c DB.cities places to visit name,c.#=@b g i

| a DB.cities g , : : : i

This is a secondary index attached to a nested collection, i.e., we can access any place to visit by providing its name

only, without having to go through the cities extent.

6 Translation of Updates

In this section we are concerned with the translation of user-level database updates over the conceptual database intoupdates over the internaldatabase. Forexample, if there was a secondary index attached to a table, then, when we insertan item into this table, we would like the secondary index to be updated as well.

Database updates can be captured by extending the denition of monoid comprehensions with the following com-prehension qualiers: Qualier p a t h := u destructively replaces the value stored at p a t h with u , qualier p a t h += u merges the singleton u with p a t h , and qualier p a t h -= u deletes all elements in the collection reached by p a t h equalto u .

For example, if the abstract database db is of type set ( int ) , then

some f true | a db, a > 10, a += 1 g

increments every database element greater than 10 by one. It returns true if there is at least one update performed.A more complex example related to the previous OODB schema is the following:

some f true | c db.cities, c.name=Portland, h c.hotels, h.name=Benson,r h.rooms, r.beds=1, r.price += 100 g

It increases the price of a single room in Portlands Benson hotel by $100.If database updates modify primitive values only, then the query translation process described in Section 3 is suf-

cient forupdate translationtoo (sincea conceptual path that reaches a primitivevalue is alwaystranslatedinto a physicalpath, whilea conceptual path that reaches a collection may be translated into a complex comprehension.) For example,if we substitute R ( DB ) for db in the last comprehension and normalize we get:

some f

true | a

DB.cities, a.name=Portland, b

DB.cities hotels,x DB.hotels, x.name=Benson, @x=b.INFO, b.#=@a,i DB.hotels rooms JI, s DB.hotels rooms,i.FROM=@x, i.TO=@s, s.INFO.beds=1, s.INFO.price += 100 g

Notice that the update s.INFO.price += 100 is over the physical database.The difcult case is when we have an update over a collection type, such as the insertion of a new hotel:

some f true | c db.cities, c.name=Portland,c.hotels += h name=Hilton, address=Park Ave, facilities= f g ,

rooms = f h beds=1, price=100 i , h beds=2, price=150 i g i g


7/28/2019 10.1.1.48.3822

15/19


U ( class name ] ] ; p a t h ; from ; to ) : cluster ( p a t h )! U ( t y p e ( db.class extent ) ] ] ; db.class extent ; from ; to )

U ( class name ] ] ; p a t h ; from ; to ) ! DB.class extent += @ 1 to ]

U ( h A

1

: t

1

; : : : ; A

n

: t

n

i ] ] ; p ; from ; to )! U ( t

1

] ] ; p : A

1

; from : A 1

; to : A 1

) + + + + U ( t

n

] ] ; p : A

n

; from : A n

; to : A n

)

U ( M ( t ) ] ] ; p ; from; to )! x B ( T ( M ( t ) ] ] ; p ) ] ] ; from ) ] + + I ( M ( t ) ] ] ; p ; x ; to ) + + U ( t ] ] ; p ; x ; to )

U ( t ] ] ; p a t h ; from; to ) ! ]

I ( M ( t ) ] ] ; p a t h ; from ; to ) : normalize ( p a t h )! DB : p a t h += @ 2 h # = @ 1 ; INFO = B ( T ( t ] ] ; p a t h ) ] ] ; from ) i ]

I ( M ( t ) ] ] ; p a t h ; from ; to ) : join index ( p a t h )! DB : p a t h JI += h FROM = @ 1 ; TO = @ 2 i ]

I ( M ( t ) ] ] ; p a t h ; from ; to ) : secondary ( p a t h ; a t t r b )! DB : p a t h a t t r b += h # = @ 1 ; a t t r b = B ( T ( t ] ] ; p a t h ) ] ] ; from ) : a t t r b i ]

I ( M ( t ) ] ] ; p a t h ; from ; to ) : partition ( p a t h ; f )! x to ; x : KEY = f ( from ) ; x : PARTITION += B ( T ( t ] ] ; p a t h ) ] ] ; from ) ]

B ( h A

1

: t

1

; : : : ; A

n

: t

n

i ] ] ; e ) ! h A

1

= B ( t

1

] ] ; e : A

1

) ; : : : ; A

n

= B ( t

n

] ] ; e : A

n

) i

B ( M ( t ) ] ] ; e ) ! M f B ( t ] ] ; x ) | x e g

B ( t ] ] ; e ) ! e

Figure 4: Update Generation

This conceptual update needs to be translated into the following internal update:

sum f 1 | a DB.cities, a.name=Portland,DB.hotels += @1 h name=Hilton, address=Park Ave, facilities=[ ], rooms= h i i ,DB.cities hotels += h #=@a, INFO=@1 i ,DB.hotels address += h #=@1, address=Park Ave i ,DB.hotels rooms += @2 h #=@1, INFO= h beds=1, price=100 i i ,DB.hotels rooms JI += h FROM=@1, TO=@2 i ,

DB.hotels rooms += @2h

#=@1, INFO=h

beds=2, price=150i i

,DB.hotels rooms JI += h FROM=@1, TO=@2 i g

That is, we may need to perform multiple internal updates for a single conceptual update. Insertions to a collection inthe internal database may be tagged by a natural number n : p a t h += @ n u . The update p a t h += @ n u inserts u intothe collection reached by p a t h but it also bindsthe memory register numbered n to the TID of the newly inserted tuple.The value of this register can be retrieved by evaluating @ n . Our physical design language requires only two registers:@ 1 and @ 2 .

Algorithm 4 (Update Generation) For each conceptual database update of the form p a t h += e , where p a t h is anM ( T ) collection, U ( t y p e ] ] ; p p a t h ; p a t h ; e ) generates a list of qualiers that update the physical database ( p p a t h is


7/28/2019 10.1.1.48.3822

16/19


the logical path expression that corresponds to p a t h , e.g., if p a t h = s.price then p p a t h = db.hotels.rooms.priceand t y p e is the type of p p a t h .) The algorithm is given in Figure 4. It uses the following support functions:

I ( M ( t ) ] ] ; p a t h ; from ; to ) : it generates additional updates for normalized tables, join indices, secondary in-dices, etc. All applicable rules are executed and the generated qualier lists are appended.

B ( t ] ] ; e ) : translates the logical expression e into a physical expression that reects the physical type t . For example,

B ( h name: string, address: string, facilities: list(string), rooms: h i i ,h name=Hilton, address=Park Ave, facilities= f g ,

rooms = f h beds=1, price=100 i , : : : g i )= h name=Hilton, address=Park Ave, facilities=[ ], rooms= h i i

Forexample, theupdategenerationalgorithmgenerates the followinglistof qualiers forthe conceptual update x.hotels+= e :

[ DB.hotels += @1h

name=e.name, address=e.address, facilities=e.facilities, rooms=h i i

,DB.cities hotels += h #=@x, INFO=tid(@1) i ,DB.hotels address += h #=tid(@1), address=e.address i ,c e.rooms,DB.hotels rooms += @2 h #=tid(@1), INFO= h beds=c.beds, price=c.price i i ,DB.hotels rooms JI += h FROM=tid(@1), TO=tid(@2) i ]

Database deletions can be handled in the same way as insertions (by substituting -= for +=). Updates of the formpath := e , where path is a collection, can be translated into:

some f true | x path, path -= x, y e, path += y g

7 Related Work

Ourframework isbased on monoidhomomorphisms, which were rstintroducedas an effective wayto capture databasequeries by V. Tannen and P. Buneman [5, 7, 6]. Their form of monoid homomorphism(also called structural recursionover the union presentation SRU) is more expressive than our calculus. Operations of the SRU form, though, requirethe validation of the associativity, commutativity, and idempotence properties of the monoid associated with the outputof this operation. These properties are hard to check by a compiler [7], which makes the SRU operation impractical.They rst recognized that there are some special cases where these conditions are automatically satised, such as forthe ext ( f ) ( A ) operation. In our view, SRU is too expressive, since inconsistent programs cannot always be detected inthat form. To our knowledge, there is no normalization algorithm for SRU forms in general. (I.e., SRU forms cannotbe put in canonical form.) On the other hand, ext ( f ) is not expressive enough, since it does not capture operations thatinvolvedifferent collection types and it cannot express predicates and aggregates. We believe thatour monoid compre-hension calculus is the most expressive subset of SRU where inconsistencies can always be detected at compile time,and, more importantly, where all programs can be put in canonical form.

Monad comprehensions were rst introduced by P. Wadler [24] as a generalization of list comprehensions (whichalready exist in some functional languages). Monoid comprehensions are related to monad comprehensions, but theyare considerably moreexpressive. In particular, monoid comprehensions can mix inputsfrom different collection typesand may return output of a different type. This mixing of types is not possible for monad comprehensions, since theyrestrict the inputsand theoutputof a comprehension to be of the same type. Monad comprehensions were rst proposedas a convenient and practical database language by P. Trinder [21, 20], who also presented many algebraic transforma-tionsover these forms as well as methods for convertingcomprehensions into joins. The monad comprehension syntaxwas also adopted by P. Buneman and V. Tannen [8] as an alternative syntax to monoid homomorphisms. The compre-hension syntax was used for capturing operations that involve collections of the same type while structural recursionwas used for expressing the rest of the operations (such as converting one collection type to another, predicates, and


7/28/2019 10.1.1.48.3822

17/19


aggregates). Our normalization algorithmis highly inuenced by L. Wongs work on normalization of monad compre-hensions [25]. He presented some powerful rules for attening nested comprehensions into canonical comprehensionswhose generators are over simple paths. These canonical forms are equivalent to our canonical forms for monoid ho-

momorphisms.Our schema transformation technique is inuenced by the Genesis extensible database management system [2, 3].

Genesis introduced a technology that enables customized database management systems to be developed rapidly, us-ing user-dened modules as building blocks. A transformation model is used to map abstract models to concrete im-plementations. This map is done with possibly more than one level of conceptual to internal mappings, transferringabstract models to more implementation-oriented ones, until a primitive layer is reached. For each type transformer,the database implementor is responsible for writing the program transformers that translate abstract schemas into con-crete schemas, and the operation expanders that translate any operation on an abstract type to a sequence of operationson the concrete type. This framework is more general than ours since it allows any mapping from abstract to concreteschemas while ours is guided by the physical design directives. We believe that ourapproach of using design directivesto guide the mapping leaves little space for errors and can be easily modied and extended.

A similar technique for mapping conceptual schemas into internal schemas was used by M. Scholl [17, 18]. Morespecically, he considered the problems of clustering and denormalization in a relational database system, that is, map-ping at tables intonested structures in which related objectsare clustered together. He also used abstractionfunctions,called conceptual-to-internalmappings, to capture the schema transformation, but he required these functions to be in-vertible. He usednormalizationtechniques forobtainingefcient nested queries fromthe conceptual atqueries, whichwere based on the algebraic equivalences between the NF 2 expressions. He recognized that these algebraic transfor-mations can only be effective if they are combined with a redundancy elimination phase where all redundant joins areremoved. Even though our physical design framework has different objectives, our approach is very similar to thisapproach. Our proposed system is more automated since most of the query translation work is done when compilingthe design directives.

Another approach for physical OODB independence was proposed for the PIOS system [1, 16]. PIOS includes alanguage, called SDL (a storage denition language), that allows one to specify the mapping from the logical to thephysical schema in a form similar to our physical design directives. The mappings supported are vertical and horizon-tal partitioning of classes and object clustering. The physical schema is computed automatically from these specica-tions and logical operations are mapped to physical operations. Other approaches for physical OODB design includeLanzelottes work on OODB query optimization [13], which is based on a graph physical design language, and theGMAP system [22] that uses a search-based algorithm to match for applicable access paths in a query.

8 Conclusion

Object-oriented database systems have long been criticized for not supporting sufcient levels of data independence.The main reason for this criticism is that early OODB systems used simple pointer chasing to perform object traver-sals, which did not allow many opportunities for optimization. There are many recent system proposals though, suchas GemStone, O2, and ODMG, that use more sophisticated methods for object traversals. These systems support adeclarative language to express queries, and advanced physical structures and alternative access paths to speed up thebulk manipulation of objects. Since object models are more complex than the relational model, most OODB systems

are lacking a formal theory for query translation and optimization that could capture the new advanced physical designproposals that are necessary to speed up object queries.In this paper we presented a formal framework for achieving a complete data independence in an OODB system.

The physical design process in this framework consists of the specication of a set of physical design directives thatdescribe in declarative form the physical design of parts of the logical database schema. We use these directives togenerate a program (the abstraction function) that automatically transforms any logical query or update into a physicalprogram. These transformations are purely algebraic and can be easily validated for correctness, since they are basedon a formal framework. The generation of the abstraction function itself is achieved by a rule-based system, which canbe easily extended to incorporate more advanced physical design directives.


7/28/2019 10.1.1.48.3822

18/19


9 Acknowledgements

The authors are grateful to Tim Sheard for helpful comments on the paper. This work is supported by the Advanced

Research Projects Agency, ARPA order number 18, monitored by the US Army Research Laboratory under contractDAAB-07-91-C-Q518, and by NSF Grants IRI 9118360 and IRI 9509955.

References

[1] N. Aloia, S. Barneva, and F. Rabitti. Supporting Physical Independence in an Object Database Server. In Work-shop on Object-Oriented Programming ECOOP92 , pp 396412, September 1992. LNCS 615.

[2] D. Batory, J. Barnett, J. Garza, K. Smith, K. Tsukuda, B. Twichell, and T. Wise. Genesis: An ExtensibleDatabaseManagement System. IEEE Transactions on Software Engineering , 14(11):17111729, November 1988.

[3] D. S. Batory, J. R. Barnett, J. F. Garza, K. P. Smith, K. Tsukuda, B. C. Twichell, and T. Wise. Genesis: A Re-congurable Database Management System. Technical report, Department of Computer Science, University of Texas at Austin, March 1986. TR-86-07.

[4] E. Bertino. A Survey of Indexing Techniques for Object-Oriented Database Management Systems. In J. Frey-tag, D. Maier, and G. Vossen, editors, Query Processing for Advanced Database Systems , pp 384418. MorganKaufmann, 1994.

[5] V. Breazu-Tannen, P. Buneman, and S. Naqvi. Structural Recursion as a Query Language. In Proceedings of theThird International Workshop on DatabaseProgramming Languages: Bulk Types and Persistent Data, Nafplion,Greece , pp 919. Morgan Kaufmann Publishers, Inc., August 1991.

[6] V. Breazu-Tannen, P. Buneman, and L. Wong. Naturally Embedded Query Languages. In 4th International Con- ference on Database Theory, Berlin, Germany , pp 140154. Springer-Verlag, October 1992. LNCS 646.

[7] V. Breazu-Tannen and R. Subrahmanyam. Logical and Computational Aspects of Programming withSets/Bags/Lists. In 18th International Colloquium on Automata, Languages and Programming, Madrid, Spain ,pp 6075. Springer-Verlag, July 1991. LNCS 510.

[8] P. Buneman, L. Libkin, D. Suciu, V. Tannen, and L. Wong. Comprehension Syntax. SIGMOD Record , 23(1):8796, March 1994.

[9] L. Fegaras. A Uniform Calculus for Collection Types. Oregon Graduate InstituteTechnical Report 94-030.Avail-able by anonymous ftp from cse.ogi.edu:/pub/crml/tapos.ps.Z .

[10] L. Fegaras and D. Maier. Towards an Effective Calculus for Object Query Languages. ACM SIGMOD Interna-tional Conference on Management of Data, San Jose, California , May 1995. Available by anonymous ftp fromcse.ogi.edu:/pub/crml/sigmod95.ps.Z .

[11] L. Fegaras and D. Stemple. Using Type Transformation in Database System Implementation. In Proceedingsof the Third International Workshop on Database Programming Languages: Bulk Types and Persistent Data, Nafplion, Greece , pp 337353. Morgan Kaufmann Publishers, Inc., August 1991.

[12] A. Kemper and G. Moerkotte. Advanced Query Processing in Object Bases Using Access Support Relations. InProceedings of the Sixteenth International Conference on Very Large Databases, Brisbane, Australia , pp 290301. Morgan Kaufmann Publishers, Inc., August 1990.

[13] R. Lanzelotte, P. Valduriez, and J. Ziane, M. Cheiney. Optimization of Nonrecursive Queries in OODBs. Deduc-tive and Object-Oriented Databases, Munich, Germany , pp 121, 1991.


7/28/2019 10.1.1.48.3822

19/19

10.1.1.48.3822

Documents