distributed dbmsuniversity of shanghai for science and technology page 2.1 分布式数据库设计...

83
Distributed DBMS University of Shanghai for Science and Technology Page 2.1 分分分分分分分分 A FRAMEWORK FOR DISTRIBUTED DATA BASE DESIGN 概概 THE DESIGN OF DATABASE FRAGMENTA TION 概概概概概 () THE ALLOCATION OF FRAGMENTS 概概概

Upload: laurence-marshall

Post on 02-Jan-2016

252 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.1

分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DAT

ABASE DESIGN (概述) THE DESIGN OF DATABASE FRAGMENT

ATION (分片设计) THE ALLOCATION OF FRAGMENTS (分

配设计)

Page 2: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.2

分布式系统设计的三维

Level of sharing 共享维 不共享,数据共享,数据 + 程序共享

Access pattern behavior 访问模式维 静态模式,动态模式(分布式数据库设计与查询处理)

Level of knowledge 访问模式知识维 用户完全已知或部分已知访问模式

Page 3: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.3

存取模式

知识级别

共享

数据+程序

数据

静态动态

部分信息

完整信息

Dimensions of the Problem

Page 4: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.4

集中式数据库设计1. Designing the “conceptual

schema" which describes the integrated database

2. Designing the "physical database," i.e., mapping the conceptual schema to storage areas and determining appropriate access methods.

Page 5: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.5

分布式数据库设计的特殊要求

+3. Designing the fragmentation.

+ 4. Designing the allocation of fragments, i.e. mapped to physical images; also the replication of fragments is determined.

Page 6: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.6

关于分片和分配的几点注意 Fragmentation design been partially analyzed in centr

alized systems with multiple storage devices. The allocation problem has been studied as the "file all

ocation problem." The distinction between two problems is conceptually

relevantone deals with the "logical criteria" which motivate

the fragmentation of a global relation one deals with the "physical" placement of data at

the various sites. 这两个问题通常是相互关联的,不可能独立地解决它们

而能确定最优的 fragmentaion 和 allocation

Page 7: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.7

关于 APPLICATION 考虑因素 :分布式数据库设计包括:分布式数据库设计和相应的分布式应

用设计1.The site from which the application is issued (site of origin

of the application).2.The frequency of activation of the application (i.e., 在单位

时间内被激活的次数 ); applications which can be issued at multiple sites, we need to know the frequency of activation of each application at each site.

3.The number, type, and the statistical distribution of accesses made by each application to each required data "object."

Page 8: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.8

设计目标( Objectives )

Processing locality 数据处理的本地性 Availability and reliability of distributed

data 分布式数据的有效性和可靠性冗余控制 Workload distribution 工作负荷的合理分

布 Storage costs and availability 存储能力和

费用

Page 9: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.9

Processing locality Maximize processing locality corresponds to the simple

principle of placing data as close as possible to the applications which use them.

Maximizing processing locality (minimizing remote references) can be done by adding the number of' local and remote references corresponding to each candidate fragmentation and fragment, allocation, and selecting the best solution among them.

The advantage of complete locality is not only the reduction of remote accesses, but also the increased simplicity in controlling the execution of the application.

Page 10: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.10

Availability and reliability of distributed data

A high degree of availability for read-only applications is achieved by storing multiple copies of the same information; the system must be able to switch to an alternative copy when the one that should be accessed under normal conditions is not available.

Reliability is also achieved by storing multiple copies of the same information - possible to recover from crashes or from the physical destruction of one of the copies by using the other still available copies.

Page 11: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.11

Workload distribution

An important feature of distributed computer systems.

To take advantage of the different powers or utilizations of computers at each site,

Maximize the degree of parallelism of execution of applications.

workload distribution might negatively affect processing locality - to consider the trade-off

Page 12: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.12

Storage costs and availability

Should reflect the cost and availability of storage at the different sites.

It is possible to have specialized sites in the network for data storage, or conversely to have sites which do not support mass storage at all.

通常存储的费用并不是非常重要 (Compared to CPU,I/O, Transmission of network).

Page 13: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.13

设计方法

Top-Down Approach 自顶向下 Bottom-Up Approach 自底向上

Page 14: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.14

Top-down approach

已有 DB… 如何分割数据及如何分配这些数据到不同站点 过程

start by designing the global schemadesigning the fragmentation of the databasethen by allocating the fragments to the sites, creatin

g the physical imagesThe approach is completed by performing, at each si

te, the "physical design" of the data which are allocated to it.

Page 15: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

User Input

View Integration

User Input

RequirementsAnalysis

Objectives

ConceptualDesign View Design

AccessInformation ES’sGCS

DistributionDesign

PhysicalDesign

LCS’s

LIS’s

Top-Down Design

Page 16: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.16

特点

能先看到雏形 问题: When the distributed database is

developed as the aggregation of existing databases, it is not easy to follow the top-down approach. The global schema is often produced as a compromise between existing data descriptions.

Page 17: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.17

Bottom-up approach

Existing databases are aggregated( 还可能是异构heterogeneous 或完全自治 autonomous) ,无设计问题 ( 信息集成 )!

Based on the integration of existing schemata into a single, global schema.

By integration, the merging of common data definitions and the resolution of conflicts among different representations given to the same data.

Page 18: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.18

bottom-up approach

Horizontal fragments of a same global relation must have the same relation schema - easily enforced in a top-down design, while it is difficult to "discover" it. The integration process should attempt to modify the definitions of local relations, so that they can be regarded as horizontal fragments of a common, global relation.

Page 19: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.19

bottom-up design requires( 异构情况下 )

The selection of a common database model for describing the global schema of the database.

The translation of each local schema into the common data model.

The integration of the local schemata into a common global schema.

Page 20: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.20

DDB 设计的两个问题

FragmentaionHorizontal FragmentationVertical fragmentation

Allocation 通常分片设计和分配设计需要统筹考

Page 21: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.21

Horizontal Fragmentation

Primary fragmentation 初级分片 Derived horizontal fragmentation 导

出分片

Page 22: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.22

水平分片原则

若 R F = {F1, F2, …, Fn}, 则完整性 对于每一个元组 tR, FiF 使得

tFi 不相交性 对 tFi, Fj 使得 tFj, i j可重构性 操作是并 ( 可以忽略 , 因为完整性

就蕴含着 ) R = {F1, F2, …, Fn}

Page 23: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.23

水平分片-例

例子EMP ( E#, NAME, DEPT, JOB, SAL, TEL, …) DEPT={1,2} JOB={‘P’, ‘-P’}假定,应用经常查询的内容是属于部门 1 且是程序

员的职员。( 80/20 原则)

则可能有的水平分片限定( Qualification ) P={ DEPT=1} P={DEPT=1, JOB=‘P’} P={DEPT=1, JOB=‘P’, SAL>500}

Page 24: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.24

如何保证分片原则

“ 手工”检查 !e.g., F1 = loc=‘Sa’ E ; F2 = loc=‘Sb’ E

生成具有满足分段原则的 predicate 谓词

Page 25: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.25

一些定义

谓词:用来执行分片选择操作的条件1.A simple predicate 简单谓词 :

Attribute = value eg. : DEPT=12.A minterm predicate (小项谓词) y :给定简单谓

词集 P= { p1, p2,.. pn },

y=pi P pi* 也既是 p1* p2* … pn*

where (pi* = pi or pi* = NOT pi) and y ≠ false

3.A fragment is the set of all tuples for which a minterm predicate holds.

Page 26: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.26

谓词生成过程

找到常用的 AP 查询的 simple predicate (Ai Value )诸如 : A<10, A>5, Loc = Sa, Loc = Sb

生成 “小项” 谓词

消除可能出现的无用谓词

Page 27: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.27

Example

Global relation EMP (EMPNUM, NAME, SAL, TAX, MGRNUM, DEPTNUM)

Assume: some important APs require information about employees who are members of department; other important APs which require only the data of employees who are programmers; these last APs can be issued at any site, and reference all programmers with the same probability.

Assume : that there are only two departments, 1 and 2; thus, DEPT = 1 → DEPT≠ 2, and vice versa.

Two simple predicates are DEPT =1 and JOB = "P" (programmer). The minterm predicates for these two predicates are

DEPT = 1 AND JOB= "P" DEPT = 1 AND JOB ≠"P" DEPT ≠ 1 AND JOB= "P" DEPT≠ 1 AND JOB ≠"P"

Page 28: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.28

讨论

All the above simple predicates are relevant

e.g. SAL > 50 is not a relevant predicate;

Page 29: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.29

complete and minimal

Let P = {p1, p2, … , pn} be a set of simple predicates.

为了正确有效进行分片,则 P必须是 complete and minimal

1 . P of predicates is complete if any two tuples belonging to the same fragment are referenced with the same probability by any application.

2. P is minimal if all its predicates are relevant.

Page 30: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.30

Example

P1 = {DEPT = 1} is not complete -the applications reference tuples of programmers with a greater probability within each fragment produced by P1 .

P2 ={DEPT = 1, JOB ="P" } is complete and minimal.

P3= {DEPT = 1, JOB = "P", SAL > 50} is complete but not minimal, since SAL > 50 is not relevant.

Page 31: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.31

Fragmentation Method

Basis Consider a predicate p1 which partitions the tuples of R into two parts which are referenced differently by at least one application. Let P = {p1}

Method Consider a new simple predicate pi which partitions at leas

t one fragment of P into two parts which are referenced in a different way by at least one application.

Set P← Ppi. Eliminate nonrelevant predicates from P. Repeat this step until the set of the minterm fragments of P

is complete.

Page 32: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.32

Example

Consider: SAL>50: if programmers have average salary greater than 50, it determines two sets of employees who are referenced differently by the applications. P1= { SAL > 50}

Consider: DEPT = 1; this predicate is relevant and is added to the previous one, P2 ={ SAL > 50, DEPT = 1}.

Consider: JOB = "P". The predicate is relevant, set P3={SAL > 50, DEPT = 1, JOB = "P" }.

then SAL > 50 is not relevant in P3, thus, the final set P4={DEPT = 1, JOB = "P" }, which is complete and minimal.

Page 33: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.33

A "reasonable" way

1. Concentrating on a few important applications

2. Not distinguishing fragments whose features are very similar

Page 34: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.34

DEPT (DEPTNUM, NAME, AREA, MGRNUM) important applications:

1.Administrative applications, issued only at sites 1 and 3; administrative applications about departments in the northern area are issued at site 1; those about departments in the southern area are issued at site 3.

2.Applications about work conducted at each department; they can be issued at any department, but they reference tuples of the departments which are closer to their site of origin with higher probability than the tuples of other departments.

Page 35: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.35

Set of predicates:

P1: DEPTNUM < 10 P2: 10 < DEPTNUM < 20 P3: DEPTNUM > 20 P4: AREA = "North" P5: AREA = "South"

Page 36: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.36

可能的谓词限定

Y1: DEPTNUM < 10 and AREA = "North“ Y2: DEPTNUM < 10 and AREA = "South“ Y3: 10 < DEPTNUM < 20 and AREA = "North“ Y4: 10 < DEPTNUM < 20 and AREA = "South“ Y5: DEPTNUM > 10 and AREA = "North“ Y6: DEPTNUM > 10 and AREA = "South“

Page 37: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.37

Reduce, e.g. AREA = "North" implies that DEPNUM > 20

y1: DEPTNUM < 10 y2: (10 < DEPTNUM < 20) AND (AREA

= "North") y3: (10 < DEPTNUM < 20) AND (AREA

= "South") y4: DEPTNUM > 20

Page 38: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.38

Derived Horizontal Fragmentation导出分片

DHF :从另一个关系的属性性质或水平分片推导出来

采用 DHF 可以使分片之间的 join 操作更加容易

Page 39: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.39

DHF 分片 example eg:SC(S#, C#, GRADE) S ( S#, SNAME. AGE, SEX) 分段设计

Define fragment SC1 as Select SC.S#,C#,GRADE From SC, S Where SC.S#=S.S# and SEX=‘M’ Define fragment SC2 as Select SC.S#,C#,GRADE From SC, S Where SC.S#=S.S# and SEX=‘F’

Page 40: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.40

分布式数据库中的 join 连接操作

distributed join join graphs

TotalSimplepartitioned

Page 41: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.41

Join graph

R S

圆圈:数据分片

无向边:两个分片之间有相同属性值的元组存在

连接图定义

Page 42: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.42

Total Join graph

R S

完全连接图定义

A join graph is total when it contains all possible edges between fragments of R and S;

Page 43: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.43

Partitioned Join graph

R S

部分连接图定义

A reduced join graph is partitioned if the graph is composed of two or more subgraphs without edges between them

Page 44: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.44

Simple Join graph

R S

简单连接图定义

A reduced join graph is simple if it is partitioned and each subgraph has just one edge

Page 45: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.45

General example (continued)

SUPPLY (SNUM, PNUM, DEPTNUM, QUAN) SUPPLY is always used together with another

relation Some applications require information about

supplies of given suppliers- join SUPPLY and SUPPLIER on the SNUM

attribute. The other applications require information about

supplies at a given department- join SUPPLY and DEPT on the DEPTNUM

attribute.

Page 46: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.46

DEPT is horizontally fragmented according to values taken by the attribute DEPTNUM

SUPPLIER is horizontally fragmented according to values taken by the attribute SNUM.

There are two possible derived fragmentations SUPPLYone through the semi-join with SUPPLIER on SNUMone through the semi-join with DEPT on DEPTNUMboth of them are correct.

The selection between these alternatives should take into account which one of the two corresponding joins is more used by applications.

Page 47: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.47

Vertical Fragmentation

Vertical Fragmentation Vertical Clustering

objective: 将某个 AP 频繁使用的属性聚集在一起,当有多个 APs 有时候需要权衡利弊。

Page 48: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.48

Vertical Fragmentation

为一全局关系 R进行分片是不容易的 , 因为随着R 的属性数目增加,可能的分片数目也大幅度增加 ( the number of possible clusters is even larger. )

两种启发式方法 (heuristic approaches)The split approach in which global relations are progressi

vely split into fragments 分裂法The grouping approach in which attributes are progressive

ly aggregated to constitute fragments 成组法

Page 49: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.49

General example (continued)

EMP(EMPNUM, NAME, SAL, TAX, MGRNUM, DEPTNUM)

APP1、 Administrative applications, concentrated at site 3, requiring NAME, SAL, and TAX of employees.

APP2、 Applications about work conducted at each department, requiring NAME, MGRNUM, and DEPTNUM of employees; these applications are issued at all sites, and reference tuples of employees in the same group of departments with 80 percent probability.

Page 50: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.50

结果

EMP1(EMPNUM, NAME, TAX, SAL) EMP2(EMPNUM, NAME, MGRNUM, DEP

TNUM)

Page 51: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.51

Mixed Fragmentation

the simplest ways : 1. Applying horizontal

fragmentation to vertical fragments

2. Applying vertical fragmentation to horizontal fragments

Page 52: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.52

THE ALLOCATION OF FRAGMENTS

nonredundant allocation ( easier )The simplest method is a “best-fit” (最佳适应) approach; a measure is associated with each possible allocation, and the site with the best measure is selected.

redundant allocationReplication introduces further complexity, 例如复

制程度,如何检索和更新等

Page 53: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.53

讨论

在进行 redundant allocation 冗余分配时,通常先求 nonredundant allocation 非冗余分配的解,在此基础上再求 redundant allocation 冗余分配的解

The "additional replication" method is a typical heuristic approach; with this method, it is possible to take into account that the increase in the degree of redundancy is progressively less beneficial.

Page 54: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.54

Two methods (for reduntant allocation) :

1. Determine the set of all sites where the benefit of allocating one copy of the fragment is higher than the cost, and allocate a copy of the fragment to each element of this set; this method selects “all beneficial sites.“ 所有得益站点法

2. Determine first the solution of the nonreplicated problem, and then progressively introduce replicated copies starting from the most beneficial; the process is terminated when no “additional replication” (附加复制法) is beneficial. 这种方法随着冗余度的增加而得益逐渐减少

Page 55: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.55

HOW TO

Measure of Costs and Benefits of Fragment Allocation

Page 56: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.56

General Criteria for Fragment Allocation

i is the fragment index j is the site index k is the application index fkj is the frequency of application k at site j rki is the number of retrieval references of application k

to fragment i uki is the number of update references of application k t

o fragment i nki = rki + uki

Page 57: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.57

Horizontal fragmentation (nonredundatn)

1 Using the “best-fit” (最佳适应法) approach for a nonreplieated allocation, we place Ri at the site where the number of references to Ri is maximum. The number of local references of Ri at site j is

Bij =∑k fkj nki

Ri is allocated at site j* such that Bij* is maximum.

Page 58: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.58

2. Using the "all beneficial sites" method for replicated allocation, Ri at all sites j where the cost of retrieval references of applications is larger than the cost of update references to Ri from applications at any other site.

Bij =∑k fkjrki - C * ∑k∑j’≠j fkj'uki

C is a constant, measures the ratio between the cost of an update and a retrieval access; typically, (C> 1).

Ri is allocated at all sites j* such that Bij is positive; when all Bij are negative, a single copy of Ri is placed at the site such that Bij is maximum.

redundant allocation approach I

Page 59: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.59

3. Using the "additional replication", in terms of increased reliability and availability of the systemsystem. di : degree of redundancy of RiFi : the benefit-Ri fully replicated at each site

In [1] : β(di)= (1 – 21-di)FiNote that, β(1) = 0, β(2)=Fi/2, β(3) = 3Fi/4, and so on.

Bij= ∑kfkjrki - C *∑k∑j’≠j fkj'uki +β(di)

[1]V. Lum et al., "1978 New Orleans Data Base Design Workshop Report," IBM Report PJ2554(33154), 7/13/79, IBM Pres. Lab., San Jose, CA, part of this report is also published in the Fifth VLDB, Pio de Janeiro, 1979.

redundant allocation approach II

Page 60: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.60

Vertical fragmentation

1. As and At: set of applications, issued at sites s or t, which use only attributes of Rs or Rt

2. A1 : set of applications local to r which use only attributes of Rs or Rt

3. A2 : set of applications local to r which reference attributes of both Rs and Rt

4. A3 : set of applications at sites different than r, s, or t We evaluate the benefit of this partitioning as

Bist=BAS+BAT-BA1-BA2-BA3

=∑kAs fksnki + ∑kAtfktnki -∑kA1fktnki -∑kA22X fkrnki -∑kA3∑jr,s,tfkjnk

i

Page 61: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.61

DATAID-D 方法

分布式数据库设计阶段需求分析概念设计分布要求设计全局逻辑设计分布设计局部逻辑设计局部物理设计

Page 62: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.62

DATAID-D 方法 - 续 设计步骤

设计数据字典全局数据模式全局操作模式简化全局模式逻辑访问表各站点逻辑模式各站点访问表局部逻辑模式局部物理模式

Page 63: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.63

DATAID-D 方法 - 续

分布要求分析阶段 收集关于分布的信息 , 如水平分段的划分谓词 ,

每个应用在各站点激活的频率等 . 分布设计阶段 从全局模式规格说明和所收集的分布要求开始 , 产生全局数据的分段模式和片段的位置分配模式

Page 64: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.64

DATAID-D 方法 - 续 分布要求分析阶段

频率表 各站点上每一应用激活次数划分表 可用于模式中各实体的潜在水平分片规则极化表 指明由一个站点发出的一给定应用访问一给定片段的频率

分布设计阶段分片设计非冗余分配冗余分配局部模式的重新构造

Page 65: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.65

实例研究 - 飞机订票系统

三个应用订票应用登记应用起飞应用

Page 66: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.66

从 到

机场

登记起飞时间 到达时间

符号

城市

权力

区域

安全规则

座位号 检查行李

班机 订票

旅客

机号

日期

可用座位

进入口

座位图

延期

种类

名字 电话

飞机订票数据库全局模式

Page 67: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.67

班机2000 3

机场40 2

旅客10000 1

从 到

订票

日期 [k]

起飞时间 [k]

符号 [k]

到达

时间 [k] 名字 [w] 电话 [w]

可用座位[o,w]

种类 [w]

全局操作模式 (订票 ) 旅客订票时激活

Page 68: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.68

Page 69: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.69

Page 70: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.70

分布结果

机场实体: 基于区域的水平分段机场 1 , 机场 2 , 机场 3

班机实体:基于起飞机场的导出水平分段班机 1 ,班机 2 , 班机 3

旅客实体: 基于旅客预定的所有班机起飞的导出水平分段

旅客 1 ,旅客 2 ,旅客 3 ,旅客 4,旅客 5,

旅客 6,旅客 7,

Page 71: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.71

分布结果 (A)

班机 1

从 到

订票

登记到

机场 1

旅客 1u旅客 4u旅客 5u旅客 7

BC

站点 1 的局部模式

Page 72: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.72

分布结果 (B)

班机 2

从 到

订票

登记到

机场 2

旅客 2u旅客 4u旅客 6u旅客 7

AC

站点 2 的局部模式

Page 73: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.73

分布结果 (C)

班机 3

从 到

订票

登记到

机场 3

旅客 3u旅客 5u旅客 6u旅客 7

AB站点 3 的局部模式

Page 74: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.74

自底向上设计

将现有的各种不同的数据库模式集成为全局模式 .

三个问题选择公用数据库模型来描述数据库的全局模式

把每个站点上的本地模式翻译成公用数据模型

把各站点上的本地数据模式集成为一公用的全局模式

Page 75: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.75

视图合并

班机

机号

日期

可用座位

出入口

座位图

延期

班机

机号

日期

可用座位

机型

座位图

班机

班机 1 班机 2

机号

日期

可用座位

座位图

出入口

延期

机型

Page 76: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.76

自底向上设计 - 续

识别相似性 不同 Site上有相似应用 , 使用各自 DB 的数据副本 , 则

这两 Site之间有某些相似点 . 识别冲突

命名冲突 同物异名 异物同名域差异 定标差异 计量单位不同结构差异 同一对象有的用实体描述 , 有的用属性描

述 . 处理操作期间不一致的数据

Page 77: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.77

举例

View1 View2

技术人员 工程师 =>

技术人员

工程师

Is-A

职工 学生

View1 View2

=> 不可并

工程师 办事员

View1 View2

=>

Employee

工程师 办事员

Page 78: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.78

举例 - 续View1

View2

技术人员 工程

工作

1 n

工程师 工程

工作

n1

=>

人员

技术员 工程师

工程工作1 n

Page 79: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.79

从 到

机场

登记起飞时间 到达时间

符号

城市

权力

区域

安全规则

座位号 检查行李

班机 订票

旅客

机号

日期

可用座位

进入口

座位图

延期

种类

名字 电话

系统 A 概念模式

Page 80: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.80

系统 B 概念模式

班机 订票

旅客

标识符

起飞

起飞时间

座位图

可用座位

种类

名字 电话

到达 到达时间

Page 81: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.81

班机

班机 B 班机 A

飞机符 (机号 )

日期

(1,3)

可用座位

座位图

出入口

登记

订票

从 到

机场

到达时间

到达机场

起飞时间

起飞机场

起飞时间到达时间

座位号检查行李

旅客

种类

名字

电话

综合后建立的全局模式

Page 82: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.82

数据集成

数据源1 2数据源 数据源3

包装器 包装器 包装器

协调器

用户应用• XML

• Ontology

• View

Page 83: Distributed DBMSUniversity of Shanghai for Science and Technology Page 2.1 分布式数据库设计 A FRAMEWORK FOR DISTRIBUTED DATABASE DESIGN (概述) THE DESIGN OF DATABASE

Distributed DBMS University of Shanghai for Science and Technology Page 2.83

Exercise 已知有如下两种段分配 : A> R1 在 Site1, R2 在 Site2, R3 在 Site3. B> R1 和 R2 在 Site1, R2 和 R3 在 Site3.另已知有如下应用 (所有应用的频率相同 ) A1: 在 Site1上发出 , 读 5 个 R1记录 , 5 个 R2记录 A2: 在 Site3上发出 , 读 5 个 R3记录 , 5 个 R2记录 A3: 在 Site2上发出 , 读 10 个 R2记录 .问 : 1. 如果以本地应用为主要设计目标 , 那个分配较优?

2. 假定 A3改为要修改 10 个 R2记录 , 并仍以本地应用为其设计目标 , 则那个分配方案较优 ?