lecture 7 data warehousing - walailak universitymit.wu.ac.th/mit/images/editor/files/l7 -...

Post on 26-Dec-2019

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

2

...ผูห้นักแน่นในสจัจะ พดูอยา่งไรทําอยา่งนัน้ จงึจะไดรั้บความสําเร็จ พรอ้มทัง้ความศรัทธาเชือ่ถอืและความยกยอ่งสรรเสรญิ จากคนทกุฝ่าย

การพดูแลว้ทําคอืพดูจรงิทําจรงิ จงึเป็นปัจจัยสําคญัในการสง่เสรมิเกยีรตคิณุของบคุคลใหเ้ดน่ชดั...

คดัจากพระบรมราโชวาทของพระบาทสมเด็จพระเจา้อยูห่วั ในพธิพีระราชทานปรญิญาบตัรของจฬุาลงกรณ์มหาวทิยาลยั

๑๐ กรกฎาคม ๒๕๔๐

Topics

• Data Warehousing Concepts

• Data mart

• Typical Architecture of a DW

• Kimball vs. Inmon in DW building approach

• Dimensionality modeling

3

Data Warehousing Concepts

4

• What is Data Warehouse? A data warehouse is a collection of

integrated databases designed to support a DSS.

• According to Inmon’s

definition(Inmon,1992): • It is a collection of integrated,

subject-oriented databases designed to support the DSS function, where each unit of data is non-volatile and relevant to some moment in time.

1. Subject-oriented Data • Organized around major subjects, such as customer,

product, sales. • Focusing on the modeling and analysis of data for

decision makers, not on daily operations or transaction processing.

• Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.

6

2. Integrated Data

• The data warehouse integrates corporate application-oriented data from different source systems, which often includes data that is inconsistent.

• รูปแบบของขอมูลไมตรงกันเน่ืองจากมีแหลงท่ีมาตางกัน

• The integrated data source must be made consistent to present a unified view of the data to the users.

7

Data Warehouse OLTP Applications

Customer

Savings

Current Accounts

Loans

Data on a given subject is defined and stored once.

8

3. Time-variant Data

• Data in the warehouse is only accurate and valid at some point in time or over some time interval.

• Time-variance is also shown in the extended time that the data is held, the implicit or explicit association of time with all data, and the fact that the data represents a series of snapshots.

Data Warehouse

9

Current value data: • ชว่งเวลา 60-90 days • key may or may not

have time element • data can be updated

Snapshot data: • ชว่งเวลา 5-10 years • key contains an

element of time • once snapshot is

made, record cannot be updated

Data warehouse Operational database

10

4. Non-volatile Data

• Data in the warehouse is not updated in real-time but is refreshed from operational systems on a regular basis.

• New data is always added as a supplement to the database, rather than a replacement.

11

Typically data in the data warehouse is not updated or deleted.

Warehouse

Read

Load

Operational

Insert, Update, Delete, or Read 12

Data Warehouse vs. OLTP

Property OLTP Data Warehouse

Response Time Sub seconds to seconds

Seconds to hours

Operations DML Primarily Read only

Nature of Data 30 – 60 days Snapshots over time

Data Organization

Application Subject, time

Size Small to large Large to very large

Data Sources Operational, Internal

Operational, Internal, External

Activities Processes Analysis 13/77

Warehouse Environment

• The warehouse environment can contain: • Enterprise data warehouse • Departmental data warehouses or business unit-

specific data marts • Personal data marts • Application-specific extracts • Operational data stores • Information catalogues • Publish and subscribe systems • Metadata repositories. 14

Presenter
Presentation Notes
Warehouse environment space is everything inside the box(except ‘external data’)

Problems of DW • Underestimation of resources for data loading • Hidden problems with source systems • Required data not captured • Increased end-user demands • Data homogenization • High demand for resources • Data ownership • High maintenance • Long duration projects • Complexity of integration 15

Data Mart • A subset of a data warehouse that supports the

requirements of a particular department or business function.

• Characteristics include • Focuses on only the requirements of one

department or business function. • Do not normally contain detailed operational data

unlike data warehouses. • More easily understood and navigated.

16

Data Warehouse

Data Marts

Flat Files

Sales

Finance

Marketing Sales

Finance HR

External Data

marketing

Dependent Data Mart

Operational Systems

External Data

Operations Data

Legacy Data

17

Independent Data Mart

18

Sales or Marketing

Flat Files

External Data

Operational Systems

External Data

Operations Data

Legacy Data

Reasons for Creating a Data Mart

• To give users access to the data they need to analyze most often.

• To provide data in a form that matches the collective view of the data by a group of users in a department or business function area.

• To improve end-user response time due to the reduction in the volume of data to be accessed.

• To provide appropriately structured data as dictated by the requirements of the end-user access tools.

19

Reasons for Creating a Data Mart (cont)

• Building a data mart is simpler compared with establishing a corporate data warehouse.

• The cost of implementing data marts is normally less than that required to establish a data warehouse.

• The potential users of a data mart are more clearly defined and can be more easily targeted to obtain support for a data mart project rather than a corporate data warehouse project.

20

Typical Architecture of a DW

21

1. Operational Data Sources

• Mainframe first generation hierarchical and network databases.

• Departmental propriety file systems (e.g. VSAM, RMS) and relational DBMSs (e.g. Informix, Oracle).

• Private workstations and servers.

• External systems such as the internet, commercially available databases, or databases associated with an organization’s suppliers or customers.

22

2. Operational Data Store (ODS)

• A repository of current and integrated operational data used for analysis.

• Often structured and supplied with data in the same way as the data warehouse.

• May act simply as a staging area for data to be moved into the warehouse.

• Often created when legacy operational systems are found to be incapable of achieving reporting requirements.

• Provides users with the ease-of-use of a relational database while remaining distant from the decision support functions of the data warehouse.

23

24

3. Load Manager

• Performs all the operations associated with the extraction and loading of data into the warehouse.

• Size and complexity will vary between data

warehouses and may be constructed using a combination of vendor data loading tools and custom-built programs.

25

4. Warehouse Manager

• Performs all the operations associated with the management of the data in the warehouse.

• Constructed using vendor data management tools and custom-built programs.

• Operations performed include • Analysis of data to ensure consistency. • Transformation and merging of source data from

temporary storage into data warehouse tables. • Creation of indexes and views on base tables.

26

27

• Generation of denormalizations, (if necessary). • Generation of aggregations, (if necessary). • Backing-up and archiving data

• In some cases, also generates query profiles to determine

which indexes and aggregations are appropriate.

• A query profile can be generated for each user, group of users, or the data warehouse and is based on information that describes the characteristics of the queries such as frequency, target table(s), and size of results set.

5. Query Manager

• Performs all the operations associated with the management of user queries.

• Typically constructed using vendor end-user data access tools, data warehouse monitoring tools, database facilities, and custom-built programs.

• Complexity determined by the facilities provided by the end-user access tools and the database.

• In some cases, the query manager also generates query profiles to allow the warehouse manager to determine which indexes and aggregations are appropriate. 28

6. Detailed Data

• Stores all the detailed data in the database schema.

• In most cases, the detailed data is not stored online but aggregated to the next level of detail.

• On a regular basis, detailed data is added to the warehouse to supplement the aggregated data.

29

7. Lightly and Highly Summarized Data

• Stores all the pre-defined lightly and highly aggregated data generated by the warehouse manager.

• The purpose of summary information is to speed up the performance of queries.

• Removes the requirement to continually perform summary operations (such as sort or group by) in answering user queries.

• The summary data is updated continuously as new data is loaded into the warehouse.

30

8. Archive / Backup Data

• Stores detailed and summarized data for the purposes of archiving and backup.

• May be necessary to backup online summary data if this data is kept beyond the retention period for detailed data.

• The data is transferred to storage archives such as magnetic tape or optical disk.

31

9. Metadata

• The management of metadata within the data warehouse is a very complex task that should not be underestimated.

• Used for a variety of purposes • Extraction and loading processes - metadata is used to

map data sources to a common view of information within the warehouse.

• Warehouse management process - metadata is used to automate the production of summary tables.

• Query management process - metadata is used to direct a query to the most appropriate data source.

32

General Metadata Issues

General metadata issues associated with Data Warehouse use: • What tables, attributes and keys does the DW

contain? • Where did each set of data come from? • What transformations were applied with cleansing? • How have the metadata changed over time? • How often do the data get reloaded? • Are there so many data elements that you need to

be careful what you ask for?

33

10. End-user Access Tools

• The principal purpose of data warehousing is to provide information to business users for strategic decision-making.

• These users interact with the warehouse using end-user access tools.

• The data warehouse must efficiently support ad hoc and routine analysis.

• High performance is achieved by pre-planning the requirements for joins, summations, and periodic reports by end-users (where possible).

34

• There are five main groups of access tools • Data reporting and query tools • Application development tools • Executive information system (EIS) tools • Online analytical processing (OLAP) tools • Data mining tools

35

Data Warehouse Information Flows

36

• Inflow - Processes associated with the extraction, cleansing, and loading of the data from the source systems into the data warehouse.

• Upflow - Processes associated with adding value to the data in the warehouse through summarizing, packaging, and distribution of the data.

• Downflow - Processes associated with archiving and backing-up/recovery of data in the warehouse.

• Outflow - Processes associated with making the data available to the end-users.

• Metaflow - Processes associated with the management of the metadata.

37

Typical DW and DM Architecture

38

DW Tools and Technologies

39

• Building a data warehouse is a complex task because there is no vendor that provides an ‘end-to-end’ set of tools.

• Necessitates that a data warehouse is built using multiple products from different vendors.

• Ensuring that these products work well together and are fully integrated is a major challenge.

Extraction, Cleansing, and Transformation Tools

• Tasks of capturing data from source systems, cleansing and transforming it, and loading the results into a target system can be carried out either by separate products, or by a single integrated solution.

• Integrated solutions include • Code Generators • Database Data Replication Tools • Dynamic Transformation Engines

40

Data Warehouse DBMS Requirements

• Load performance • Load processing • Data quality management • Query performance • Terabyte(1012 byte) scalability • Mass user scalability • Networked data warehouse • Warehouse administration • Integrated dimensional analysis • Advanced query functionality

41

42

Enterprise VS. Dimensional data warehouse

Bill Inmon

• the “father of data warehouse” • He received his Bachelor of Science

degree in Mathematics from Yale University, and his Master of Science degree in Computer Science from New Mexico State University

• Inmon's approach is often characterized as a top-down approach

43

44

http://www.zentut.com/data-warehouse/bill-inmon-data-warehouse/

Enterprise data warehouse

• is a central element in the Inmon’s data warehouse architecture.

• enterprise data warehouse is an integrated repository of atomic data.

• Data in the enterprise data warehouse is captured at a very lowest level of detail. Data in the enterprise data warehouse is stored in relational database and uses third normal database design.

45

Ralph Kimball • His books on dimensional design

techniques have become the all time best sellers in data warehousing: "The Data Warehouse Toolkit : Practical Techniques for Building Dimensional Data Warehouses " (Wiley, 1996), "The Data Warehouse Lifecycle Toolkit : Expert Methods for Designing, Developing, and Deploying Data Warehouses" ( Wiley 1998) e "The Data Webhouse Toolkit: Building the Web-Enabled Data Warehouse" (Wiley 2000).

46

47

Dimensional data warehouse

• The dimensional data warehouse contains enterprise data in high granular format.

• While Bill Inmon’s data warehouse architecture using ER modeling, dimensional data warehouse is designed using dimensional modeling.

• It means dimensional data warehouse consists of star schema or cubes. The analytic systems or reporting tools can access data from dimensional data warehouse directly.

48

Kimball vs. Inmon in DW building approach

• Bill Inmon recommends to build data warehouse that follows top-down approach.

• In Inmon’s philosophy, it is starting with building a big centralized enterprise data warehouse where all available data from transaction systems are consolidated into a subject-oriented, integrated, time-variant and non-volatile collection of data that supports decision making. then data marts are built for analytic needs of departments.

49

• Contrast to Bill Inmon approach, Ralph Kimball recommends to build data warehouse that follows bottom-up approach.

• In Kimball’s philosophy, it is first start with mission critical data marts that serve analytic needs of departments. Then it is integrating these data marts for data consistency through a so called information bus.

• Kimball makes uses of dimensional model to address the needs of departments in various areas within enterprise.

50

51

http://www.zentut.com/data-warehouse/kimball-and-inmon-data-warehouse-architectures/

Dimensionality modeling

• A logical design technique that aims to present the data in a standard, intuitive form that allows for high-performance access

• Uses the concepts of Entity-Relationship modeling with some important restrictions.

• Every dimensional model (DM) is composed of one table with a composite primary key, called the fact table, and a set of smaller tables called dimension tables.

52

• Each dimension table has a simple (non-composite) primary key that corresponds exactly to one of the components of the composite key in the fact table.

• Forms ‘star-like’ structure, which is called a star schema or star join.

• All natural keys are replaced with surrogate keys. Means that every join between fact and dimension tables is based on surrogate keys, not natural keys.

• Surrogate keys allows the data in the warehouse to have some independence from the data used and produced by the OLTP systems.

53

54/77

Star schema for property sales of DreamHome

• Star schema is a logical structure that has a fact table containing factual data in the center, surrounded by dimension tables containing reference data, which can be denormalized.

• Facts are generated by events that occurred in the past, and are unlikely to change, regardless of how they are analyzed.

• Bulk of data in data warehouse is in fact tables, which can be extremely large.

• Important to treat fact data as read-only reference data that will not change over time.

55

• Most useful fact tables contain one or more numerical measures, or ‘facts’ that occur for each record and are numeric and additive.

• Dimension tables usually contain descriptive textual information.

• Dimension attributes are used as the constraints in data warehouse queries.

• Star schemas can be used to speed up query performance by denormalizing reference information into a single dimension table.

56

• Snowflake schema is a variant of the star schema where dimension tables do not contain denormalized data.

• Starflake schema is a hybrid structure that contains a mixture of star (denormalized) and snowflake (normalized) schemas. Allows dimensions to be present in both forms to cater for different query requirements.

57

Property sales with normalized version of Branch dimension table

58

top related