data ware housing basics

Data Warehouse

-Yoga Kathirvelu

OBJECTIVES

What is a Data Warehouse?

Data Warehouse Architecture

Data Warehouse Design Considerations

Data Warehouse Terminologies

Extraction – Transformation – Loading

Mining, Business Intelligence & Reporting

Data-Data Everywhere yet….. What is Data?

I can’t find the data that I need

Data scattered all across the network

Data stored in disparate formats

I can’t understand the data that I see

How to interpret

Need someone to translate

I can’t use the data that I get

Different rules implemented across

Missing or inconsistent data

I don’t get the data when it matters

Data comes in very late

Data collection is very time consuming

What the users want Data should be integrated across the enterprise

Data reporting should be uniform irrespective of how it is stored

Data should be available when we want it

Summary data had a real value to the organization

Historical data holds the key to understanding data over time

Can we clean, merge and enrich the data???

Enter Data Warehouse…..

Data Warehouse A single, complete and consistent store of data obtained

from a variety of different sources made available to end

users in a format that they can understand and use in a

business context.

Data Warehousing as a process

A technique for assembling and managing data from various sources for the purpose of answering business questions, thus making decisions that were previously not possible

Creating a decision support database maintained separately from the organization’s operational database

Goals of a Data Warehouse It must make an organization’s information

more accessible

It must make the organization’s information

consistent

It must be adaptive and resilient to change

It must be defender of the organization’s data

It must serve as a foundation for improved

decision making

OLTP Systems vs Data Warehouse

OLTP Systems Data Warehouse

Application Oriented Subject Oriented

Used to run business Used to analyze business

Detailed data Summarized & refined data

Current up to date data Snapshot data

Repetitive access Ad-hoc access

Clerical User Knowledge User (Manager)

Few Records accessed at a time (tens)

Large volumes accessed at a time (millions)

OLTP Systems vs Data Warehouse

OLTP Systems Data Warehouse

No data redundancy Redundancy present

Database Size (100MB -100 GB) Database Size (100GB –few terabytes)

Transaction throughput is the performance metric

Query throughput is the performance metric

Thousands of users Hundreds of users

Read/Update Access Mostly Read (Updation through batch loads)

Performance Sensitive Performance relaxed

Data Warehousing Architecture

Source Systems OLTP Systems

Range from Flat files to RDBMS

Maintain little or no history

Data Pull or Data Push

Data Extraction Window

Extraction – Transformation – Loading Extraction

Capture of data from Source Systems

Important to decide the frequency of Extraction

Merging Bringing data together from different operational

sources.

Choosing information from each functional system to populate the single occurrence of the data item in the warehouse

Extraction – Transformation – Loading Conditioning

The conversion of data types from the source to the target data store (warehouse) -- always a relational database

Eg. OLTP Date stored as text (DDMMYY); DW format is Oracle Date type.

Scrubbing Ensuring all data meets the input validation rules

which should have been in place when the data was captured by the operational system.

Eg. Country of the Customer should have been entered in the Country field but entered in 1 of the address field.

Extraction – Transformation – Loading Enrichment

Bring data from external sources to augment/enrich operational data.

Eg. Currency conversion rates being brought in from external sources.

Validating Process of ensuring that the data captured is

accurate and transformation process is correct

Eg. Date of Birth of a Customer should not be more than today’s date

Extraction – Transformation – Loading

Loading Loading the Extracted and Transformed data into

the Staging Area or the Data Warehouse

First time bulk load to get the historical data into the Data Warehouse

Periodic Incremental loads to bring in modified data

The Loading window should be as small as possible

Should be clubbed with strong Error Management process to capture the failures or rejections in the Loading process

ETL Process – Issues & Challenges

Consumes 70-80% of project time

Heterogeneous source systems

Little or no control over source systems

Scattered source systems working is different time

zones having different currencies

Different measurement unit

Data not captured by OLTP systems

Data Quality

Incremental Load vs Complete Refresh

Complete refresh is required when the data is being

loaded into the DW for the first time

Subsequent to that, DW should be refreshed with

incremental loads

Complete refresh or Full Load is too disruptive and not

required if updates since last load can be identified

easily

Some master data might require only a 1 time load into

the DW

When to Refresh? Periodically (e.g., every night, every week) or after

significant events

On every update; not warranted unless DW users

require current data (up to the minute stock quotes)

Refresh policy set by administrator based on user

needs and traffic

Different strategies might be required for different

sources

Staging Area An intermediate area between the Operational Source

Systems and the data presentation area

Analogous to the kitchen of a restaurant

Accessible only to the skilled personnel; no user access

The structure is closer to the Operational Systems rather than the DW

Data arriving at different point of time is merged and then loaded into the DW

Usually does not maintain history; only a temporary area

Data Warehouse Design Design of the DW must directly reflect the way

the managers look at the business

Should capture the important measurements along with the parameters by which these measurements are viewed

It must facilitate data analysis

The methodology on which the DW is designed is called as Dimensional Modeling (different from ER Modeling)

ER Modeling ER Model views the

components as Entities & Relationships

Entities: principal data object about which information is collected

Relationship: association between two or more entities

Attributes: smaller pieces of information within an entity

Dimensional Modeling Represents data in a standard framework

Framework is easily understandable by the end-users

Contains same information as the ER Model

Facilitates data retrieval and analysis

Entities are called Facts and Dimensions

A generic representation of a dimension model in

which a fact table is joined to a number of dimensions

is called a Star Schema

Star Schema

Fact Table The Primary table in a dimensional model where the

numeric performance measurements of the business are stored

The most useful facts are numeric and additive

Each measurement is taken at the intersection of all the dimensions

Tend to be deep in term of number of rows but narrow in terms of number of columns

They have Composite Primary Keys which consists of all Foreign Keys of referred Dimensions

Dimension Table Contain textual descriptors of the business

Lesser no. of rows but more no. of columns

Linked to the Fact using a Foreign Key called Surrogate Key

Dimension attributes serve as the primary source of query

constraints, groupings and report labels

Minimize the use of Codes by replacing them with verbose

text

Concatenated piece of text serving as a code should be

broken into constituent piece of information

Contain hierarchical information

Data stored in a denormalized form

Dimension Table Client Dimension

CLIENT KEY

CLIENT ID CLIENT NAME CLIENT GROUP CODE

CLIENT GROUP NAME

CLIENT AREA

1 100 ABC LTD. 1234 XYZ LTD. A1

2 200 DEF LTD. 6789 RST LTD. A1

3 300 GHI LTD. 1234 XYZ LTD. A2

CLIENT KEY

DEBTOR KEY

TIME KEY CURRENCY KEY

AMOUNT INVESTED AMOUNT EARNED

1 5 1 100 10,000 3,000

2 6 1 100 20,000 7,000

3 5 1 100 15,000 6,000

Client Fact

Potential Queries

Total Amount Earned by Client Group XYZ Ltd.

Total Amount Earned by Clients in Area A1

Total Amount Invested by Client ABC Ltd.

SURROGATE KEY

NATURAL KEY

Surrogate Key

Integers that are assigned sequentially as needed to

populate a dimension

Serve to join the Dimension to the Fact table

Better to use Surrogate Key instead of Natural Key

They buffer the DW environment from operational

changes

Operational Codes or Natural Keys might get

reassigned in the Operational Systems

Surrogate Key Granularity of the dimension might be different from

the Natural Key

Natural Keys might not be unique across business

Better for performance; Natural Keys might be bulky alphanumeric character string

There might not be a Natural Key available in the source system

Data Marts A Data Mart is a collection of subject areas organized for

decision support based on the needs for a given department

Finance will have their own Data Mart, Marketing their own etc.

Each set of Users have their own interpretation of what their Data Mart should look like

The Database design of a Data Mart is built around a start-join structure that is optimal for the specific set of users

A Data Mart generally contains aggregated or summarized data whereas DW would contain more granular data

Types of Data Marts Dependent Data Mart

A Data Mart whose source is the Data Warehouse

All dependent Data Marts are loaded from the same source – the Data Warehouse

Independent Data Mart

A Data Mart whose source is the legacy application environment

Each independent Data Mart is fed uniquely and separately by the individual source systems

Dimensions revisited Till now we have assumed Dimensions to be independent

of time

Dimension attributes are relatively static, they are not fixed forever

Business Users might want to track the impact of each and every attribute change

We can preserve the independent dimensional structure with only relatively minor adjustments

These nearly constant dimensions are called Slowly Changing Dimensions (SCD’s)

3 Basic techniques for maintaining SCD’s

SCD – Type 1 The new information simply overwrites the original

information No history is maintained

Client Master Key Client Name Client Country

1000 Nunn Mozhi India

Before Change:

Client Master Key Client Name Client Country

1000 Nunn Mozhi US

After Change:

SCD – Type 1 Advantages

Easiest technique in terms of implementation

Disadvantages

All history will be lost

Usage

About 50% of the time

When to use

When it is not necessary for the DW to maintain history

SCD – Type 2 A new record is added to the dimension to

represent the new information

The new record gets its own Primary Key

Client Master Key Client Name Client Country Latest Record

1000 Nunn Mozhi India Y

Before Change:

Client Master Key Client Name Client Country Latest Record

1000 Nunn Mozhi India N

1001 Nunn Mozhi US Y

After Change:


Allows us to accurately store history

Disadvantages

This will cause the table size to grow fast

Storage and Performance might become a concern

Usage

About 50% of the time

When to use

When it is necessary for the DW to maintain history

SCD – Type 3 There will be 2 columns to indicate the particular

attribute of interest; 1 indicating the original value and one indicating the current value

Client Master Key

Client Name

Original Client Country

Current Client Country

Effective Date

1000 Nunn Mozhi India 12-Jan-2004

Before Change:

Client Master Key

Client Name

Original Client Country

Current Client Country

Effective Date

1000 Nunn Mozhi India US 13-Apr-2004

After Change:


Does not increase the table size drastically

Allows us to keep some part of history

Disadvantages

Will not be able to keep all history when the value of the

attribute changes more than once

Usage

Very rarely use

When to use

When the no. of attribute changes are finite

Type of Dimensions Conformed Dimension

A single Dimension referring to more than one Fact

Exact copy of the same Dimension used in more than one Data Mart

When one Dimension is created as a subset of another existing Dimension

CLIENT DIMENSION

TRANSACTION FACT DAILY SUMMARY FACT

Type of Dimensions Junk Dimension

Is a convenient grouping of typically low cardinality flags and indicators

Can be used to handle infrequently populated, open ended comments field sometimes attached to a Fact row

OUTCOME DIMENSION

MARKETING FACT

VALUES: N/A

ACCEPTED DECLINED

Type of Dimensions

Degenerate Dimension

A Dimension Key, such as Transaction Number, that has no attributes and hence does not join to an actual dimension table

TRANSACTION FACT

CLIENT MASTER KEY

TIME KEY

CURRENCY KEY

TRANSACTION ID

AMOUNT

LAST EXTRACTION DATE

DEGENERATE DIMENSION

Type of Facts

Type of Facts Factless Fact

A Fact table that has no facts but captures certain many-to-many relationship between the dimension keys

CLIENT KEY

DEBTOR KEY

COUNT

CLIENT DIMENSION DEBTOR DIMENSION

FACT TABLE

ALWAYS = 1

Dimension Normalization – Snow flaking Removing the redundant information from the

Dimension and placing them in a separate Dimension

These two Dimensions are joined by a key called Snowflake Key

Aim is to reduce the total amount of storage needed for a dimension

When to Snowflake

Very large dimensions

Some attributes not common to all the records

Dimension Normalization – Snow flaking

Advantages

Reduces disk space usage

Easy to maintain

Disadvantages

Presentation layer becomes complicated

Data retrieval time increases

Might not save too much of disk space considering that Dimensions take less space and Facts take more of space

Data Mining A relatively new data analysis technique

It is very different from query and reporting

You do not ask a particular question of the data but use specific algorithms that analyze data and report what they discover

Not done by normal end users; done by specialists

Used for: Statistical Analysis

Knowledge Discovery

Business Intelligence BI is the leveraging of the Data Warehouse to help

make business decisions and recommendations

Information and data rules engines are leveraged to

help make these decisions along with statistical

analysis tools and data mining tools

Expensive and a very specialized set of activity

Not performed by the end users; done by specialists

Tools and Technology o ETL Tools

o Informatica o DataStage o Talend o Pentaho Kettle

o Reporting Tools o Business Objects o Cognos o Micro strategy

• Modeling Tools • Erwin • DB Designer

Queries ?............

data ware housing basics

Documents