data ware housing basics
TRANSCRIPT
Data Warehouse
-Yoga Kathirvelu
OBJECTIVES
What is a Data Warehouse?
Data Warehouse Architecture
Data Warehouse Design Considerations
Data Warehouse Terminologies
Extraction – Transformation – Loading
Mining, Business Intelligence & Reporting
Data-Data Everywhere yet….. What is Data?
I can’t find the data that I need
Data scattered all across the network
Data stored in disparate formats
I can’t understand the data that I see
How to interpret
Need someone to translate
I can’t use the data that I get
Different rules implemented across
Missing or inconsistent data
I don’t get the data when it matters
Data comes in very late
Data collection is very time consuming
What the users want Data should be integrated across the enterprise
Data reporting should be uniform irrespective of how it is stored
Data should be available when we want it
Summary data had a real value to the organization
Historical data holds the key to understanding data over time
Can we clean, merge and enrich the data???
Enter Data Warehouse…..
Data Warehouse A single, complete and consistent store of data obtained
from a variety of different sources made available to end
users in a format that they can understand and use in a
business context.
Data Warehousing as a process
A technique for assembling and managing data from various sources for the purpose of answering business questions, thus making decisions that were previously not possible
Creating a decision support database maintained separately from the organization’s operational database
Goals of a Data Warehouse It must make an organization’s information
more accessible
It must make the organization’s information
consistent
It must be adaptive and resilient to change
It must be defender of the organization’s data
It must serve as a foundation for improved
decision making
OLTP Systems vs Data Warehouse
OLTP Systems Data Warehouse
Application Oriented Subject Oriented
Used to run business Used to analyze business
Detailed data Summarized & refined data
Current up to date data Snapshot data
Repetitive access Ad-hoc access
Clerical User Knowledge User (Manager)
Few Records accessed at a time (tens)
Large volumes accessed at a time (millions)
OLTP Systems vs Data Warehouse
OLTP Systems Data Warehouse
No data redundancy Redundancy present
Database Size (100MB -100 GB) Database Size (100GB –few terabytes)
Transaction throughput is the performance metric
Query throughput is the performance metric
Thousands of users Hundreds of users
Read/Update Access Mostly Read (Updation through batch loads)
Performance Sensitive Performance relaxed
Data Warehousing Architecture
Source Systems OLTP Systems
Range from Flat files to RDBMS
Maintain little or no history
Data Pull or Data Push
Data Extraction Window
Extraction – Transformation – Loading Extraction
Capture of data from Source Systems
Important to decide the frequency of Extraction
Merging Bringing data together from different operational
sources.
Choosing information from each functional system to populate the single occurrence of the data item in the warehouse
Extraction – Transformation – Loading Conditioning
The conversion of data types from the source to the target data store (warehouse) -- always a relational database
Eg. OLTP Date stored as text (DDMMYY); DW format is Oracle Date type.
Scrubbing Ensuring all data meets the input validation rules
which should have been in place when the data was captured by the operational system.
Eg. Country of the Customer should have been entered in the Country field but entered in 1 of the address field.
Extraction – Transformation – Loading Enrichment
Bring data from external sources to augment/enrich operational data.
Eg. Currency conversion rates being brought in from external sources.
Validating Process of ensuring that the data captured is
accurate and transformation process is correct
Eg. Date of Birth of a Customer should not be more than today’s date
Extraction – Transformation – Loading
Loading Loading the Extracted and Transformed data into
the Staging Area or the Data Warehouse
First time bulk load to get the historical data into the Data Warehouse
Periodic Incremental loads to bring in modified data
The Loading window should be as small as possible
Should be clubbed with strong Error Management process to capture the failures or rejections in the Loading process
ETL Process – Issues & Challenges
Consumes 70-80% of project time
Heterogeneous source systems
Little or no control over source systems
Scattered source systems working is different time
zones having different currencies
Different measurement unit
Data not captured by OLTP systems
Data Quality
Incremental Load vs Complete Refresh
Complete refresh is required when the data is being
loaded into the DW for the first time
Subsequent to that, DW should be refreshed with
incremental loads
Complete refresh or Full Load is too disruptive and not
required if updates since last load can be identified
easily
Some master data might require only a 1 time load into
the DW
When to Refresh? Periodically (e.g., every night, every week) or after
significant events
On every update; not warranted unless DW users
require current data (up to the minute stock quotes)
Refresh policy set by administrator based on user
needs and traffic
Different strategies might be required for different
sources
Staging Area An intermediate area between the Operational Source
Systems and the data presentation area
Analogous to the kitchen of a restaurant
Accessible only to the skilled personnel; no user access
The structure is closer to the Operational Systems rather than the DW
Data arriving at different point of time is merged and then loaded into the DW
Usually does not maintain history; only a temporary area
Data Warehouse Design Design of the DW must directly reflect the way
the managers look at the business
Should capture the important measurements along with the parameters by which these measurements are viewed
It must facilitate data analysis
The methodology on which the DW is designed is called as Dimensional Modeling (different from ER Modeling)
ER Modeling ER Model views the
components as Entities & Relationships
Entities: principal data object about which information is collected
Relationship: association between two or more entities
Attributes: smaller pieces of information within an entity
Dimensional Modeling Represents data in a standard framework
Framework is easily understandable by the end-users
Contains same information as the ER Model
Facilitates data retrieval and analysis
Entities are called Facts and Dimensions
A generic representation of a dimension model in
which a fact table is joined to a number of dimensions
is called a Star Schema
Star Schema
Fact Table The Primary table in a dimensional model where the
numeric performance measurements of the business are stored
The most useful facts are numeric and additive
Each measurement is taken at the intersection of all the dimensions
Tend to be deep in term of number of rows but narrow in terms of number of columns
They have Composite Primary Keys which consists of all Foreign Keys of referred Dimensions
Dimension Table Contain textual descriptors of the business
Lesser no. of rows but more no. of columns
Linked to the Fact using a Foreign Key called Surrogate Key
Dimension attributes serve as the primary source of query
constraints, groupings and report labels
Minimize the use of Codes by replacing them with verbose
text
Concatenated piece of text serving as a code should be
broken into constituent piece of information
Contain hierarchical information
Data stored in a denormalized form
Dimension Table Client Dimension
CLIENT KEY
CLIENT ID CLIENT NAME CLIENT GROUP CODE
CLIENT GROUP NAME
CLIENT AREA
1 100 ABC LTD. 1234 XYZ LTD. A1
2 200 DEF LTD. 6789 RST LTD. A1
3 300 GHI LTD. 1234 XYZ LTD. A2
CLIENT KEY
DEBTOR KEY
TIME KEY CURRENCY KEY
AMOUNT INVESTED AMOUNT EARNED
1 5 1 100 10,000 3,000
2 6 1 100 20,000 7,000
3 5 1 100 15,000 6,000
Client Fact
Potential Queries
Total Amount Earned by Client Group XYZ Ltd.
Total Amount Earned by Clients in Area A1
Total Amount Invested by Client ABC Ltd.
SURROGATE KEY
NATURAL KEY
Surrogate Key
Integers that are assigned sequentially as needed to
populate a dimension
Serve to join the Dimension to the Fact table
Better to use Surrogate Key instead of Natural Key
They buffer the DW environment from operational
changes
Operational Codes or Natural Keys might get
reassigned in the Operational Systems
Surrogate Key Granularity of the dimension might be different from
the Natural Key
Natural Keys might not be unique across business
Better for performance; Natural Keys might be bulky alphanumeric character string
There might not be a Natural Key available in the source system
Data Marts A Data Mart is a collection of subject areas organized for
decision support based on the needs for a given department
Finance will have their own Data Mart, Marketing their own etc.
Each set of Users have their own interpretation of what their Data Mart should look like
The Database design of a Data Mart is built around a start-join structure that is optimal for the specific set of users
A Data Mart generally contains aggregated or summarized data whereas DW would contain more granular data
Types of Data Marts Dependent Data Mart
A Data Mart whose source is the Data Warehouse
All dependent Data Marts are loaded from the same source – the Data Warehouse
Independent Data Mart
A Data Mart whose source is the legacy application environment
Each independent Data Mart is fed uniquely and separately by the individual source systems
Dimensions revisited Till now we have assumed Dimensions to be independent
of time
Dimension attributes are relatively static, they are not fixed forever
Business Users might want to track the impact of each and every attribute change
We can preserve the independent dimensional structure with only relatively minor adjustments
These nearly constant dimensions are called Slowly Changing Dimensions (SCD’s)
3 Basic techniques for maintaining SCD’s
SCD – Type 1 The new information simply overwrites the original
information No history is maintained
Client Master Key Client Name Client Country
1000 Nunn Mozhi India
Before Change:
Client Master Key Client Name Client Country
1000 Nunn Mozhi US
After Change:
SCD – Type 1 Advantages
Easiest technique in terms of implementation
Disadvantages
All history will be lost
Usage
About 50% of the time
When to use
When it is not necessary for the DW to maintain history
SCD – Type 2 A new record is added to the dimension to
represent the new information
The new record gets its own Primary Key
Client Master Key Client Name Client Country Latest Record
1000 Nunn Mozhi India Y
Before Change:
Client Master Key Client Name Client Country Latest Record
1000 Nunn Mozhi India N
1001 Nunn Mozhi US Y
After Change:
SCD – Type 2 Advantages
Allows us to accurately store history
Disadvantages
This will cause the table size to grow fast
Storage and Performance might become a concern
Usage
About 50% of the time
When to use
When it is necessary for the DW to maintain history
SCD – Type 3 There will be 2 columns to indicate the particular
attribute of interest; 1 indicating the original value and one indicating the current value
Client Master Key
Client Name
Original Client Country
Current Client Country
Effective Date
1000 Nunn Mozhi India 12-Jan-2004
Before Change:
Client Master Key
Client Name
Original Client Country
Current Client Country
Effective Date
1000 Nunn Mozhi India US 13-Apr-2004
After Change:
SCD – Type 3 Advantages
Does not increase the table size drastically
Allows us to keep some part of history
Disadvantages
Will not be able to keep all history when the value of the
attribute changes more than once
Usage
Very rarely use
When to use
When the no. of attribute changes are finite
Type of Dimensions Conformed Dimension
A single Dimension referring to more than one Fact
Exact copy of the same Dimension used in more than one Data Mart
When one Dimension is created as a subset of another existing Dimension
CLIENT DIMENSION
TRANSACTION FACT DAILY SUMMARY FACT
Type of Dimensions Junk Dimension
Is a convenient grouping of typically low cardinality flags and indicators
Can be used to handle infrequently populated, open ended comments field sometimes attached to a Fact row
OUTCOME DIMENSION
MARKETING FACT
VALUES: N/A
ACCEPTED DECLINED
Type of Dimensions
Degenerate Dimension
A Dimension Key, such as Transaction Number, that has no attributes and hence does not join to an actual dimension table
TRANSACTION FACT
CLIENT MASTER KEY
TIME KEY
CURRENCY KEY
TRANSACTION ID
AMOUNT
LAST EXTRACTION DATE
DEGENERATE DIMENSION
Type of Facts
Type of Facts Factless Fact
A Fact table that has no facts but captures certain many-to-many relationship between the dimension keys
CLIENT KEY
DEBTOR KEY
COUNT
CLIENT DIMENSION DEBTOR DIMENSION
FACT TABLE
ALWAYS = 1
Dimension Normalization – Snow flaking Removing the redundant information from the
Dimension and placing them in a separate Dimension
These two Dimensions are joined by a key called Snowflake Key
Aim is to reduce the total amount of storage needed for a dimension
When to Snowflake
Very large dimensions
Some attributes not common to all the records
Dimension Normalization – Snow flaking
Advantages
Reduces disk space usage
Easy to maintain
Disadvantages
Presentation layer becomes complicated
Data retrieval time increases
Might not save too much of disk space considering that Dimensions take less space and Facts take more of space
Data Mining A relatively new data analysis technique
It is very different from query and reporting
You do not ask a particular question of the data but use specific algorithms that analyze data and report what they discover
Not done by normal end users; done by specialists
Used for: Statistical Analysis
Knowledge Discovery
Business Intelligence BI is the leveraging of the Data Warehouse to help
make business decisions and recommendations
Information and data rules engines are leveraged to
help make these decisions along with statistical
analysis tools and data mining tools
Expensive and a very specialized set of activity
Not performed by the end users; done by specialists
Tools and Technology o ETL Tools
o Informatica o DataStage o Talend o Pentaho Kettle
o Reporting Tools o Business Objects o Cognos o Micro strategy
• Modeling Tools • Erwin • DB Designer
Queries ?............