data ware hose fundamentals

Upload: anilkumar-mannem

Post on 06-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

Data werehouse concepts

TRANSCRIPT

10

DATA WARE HOSE FUNDAMENTALS:1. What is Data warehouse?1)It is a database which stores the historical data 2) We use this data for query & analysis purpose based on the analysis results we will make the business decisions 3) We call DWH as decision supporting system4) The data that is available in DWH is read only data2. Why do we use DWH?a)A Data Warehouse DeliversEnhanced Business IntelligenceBy providing data from various sources, managers and executives will no longer need to make business decisions based on limited data or their gut. In addition, data warehouses and related BI can be applied directly to business processes including marketing segmentation, inventory management, financial management, and sales. b) A Data Warehouse Saves TimeSince business users can quickly access critical data from a number of sourcesall in one placethey can rapidly make informed decisions on key initiatives. They wont waste precious time retrieving data from multiple sources.Not only that but the business execs can query the data themselves with little or no support from ITsaving more time and more money. That means the business users wont have to wait until IT gets around to generating the reports, and those hardworking folks in IT can do what they do bestkeep the business running.c) A Data Warehouse Enhances Data Quality and ConsistencyA data warehouseimplementation includes the conversion of data fromnumerous source systems into a common format. Since each data from the various departments is standardized, each department will produce results that are in line with all the other departments. So you can have more confidence in the accuracy of your data. And accurate data is the basis for strong business decisions.d) A Data Warehouse Provides Historical IntelligenceA data warehouse stores large amounts of historical data so you can analyze different time periods and trends in order to make future predictions. Such data typically cannot be stored in a transactional database or used to generate reports from a transactional system.e) A Data Warehouse Generates a High ROIROI retun oninvestmentFinally, the piece de resistancereturn on investment. Companies that have implemented data warehouses and complementary BI systems havegenerated more revenueand saved more money than companies that havent invested in BI systems and data warehouses.3.What is the Difference between Ralph Kimball and WH Inmon Approach?Ralph KimballWH Inmon

1)As per ralph Kimball approach first we define the data marts and we should define data warehouse.1)As per WH Inmon approach first we define data warehouse and we should define data marts

2) In this approach data marts are called independent data marts2) In this approach data marts are called dependent data marts

3) This approach we called bottom up approach3) this approach we called as top down approach

4. What are Characteristics of DWH?1)Time variant: All the data in data warehouse identified with particular time period. The data of DWH must be historicalEx: Day, Week, Month etc.2) Non-volatile: Data in DWH are never over written or deleted once committed. The data is static. It is used for read only data we cant change. This data used for future reporting.

3) Subject oriented: In this we collect the data from different subject area. The collected data must be business oriented.Ex: Sales, accounts, HR etc.

4) Integrated: The DWH contains data gathered from all organizations and merged into a coherent. 5. What is the difference between ODS and DWHODS(Operational Data Store)DWH

1) It is design to support the operational data monitoring.1) It is design to support the decision making process.

2) Data is volatile2) Data is non-volatile

3) It contains current data3) It contains historical data

4) It contains detailed data4) It contains summary data

5) It contains less volume of data5) It contains high volume of data

6. What is the difference between Data Mart and DWH?Data MartData Warehouse

1) Data mart is usually sponsored at the department level and developed with a specific issue or subject in mind, a data mart is a data warehouse with a focused objective.1) Data warehouse is a Subject-Oriented, Integrated, Time-Variant, Non-volatile collection of data in support of decision making.

2) A data mart is used on a business division/ department level.2) A data warehouse is used on an enterprise level

3) A Data Mart is a subset of data from a Data Warehouse. Data Marts are built for specific user groups.3) A Data Warehouse is simply an integrated consolidation of data from a variety of sources that is specially designed to support strategic and tactical decision making.

4) By providing decision makers with only a subset of data from the Data Warehouse, Privacy, Performance and Clarity Objectives can be attained.4) The main objective of Data Warehouse is to provide an integrated environment and coherent picture of the business at a point in time.

7. What are dimensions and Facts? Give the example?Dimensions:It is structural or characteristic attribute. A dimension is essentially an entry point for getting at the facts. Dimensions are things of interests to the business Ex: Customer, Product, Store, Country, State etc.Facts: A business performance measurement tropically numeric and additive. It is stored in the fact table. Facts are also called KPI (key performance indicator)OrA fact is a numeric value that a business wishes to count or sumEx: Store key, Customer Key, Orders, The number of products sold, the value of products sold, the products produced8. What are different types of dimensions?1) Slowly changed dimension: It will change with respect to time.Ex: Customer address etc.It is three types1) Type 1: In case of type 1 we will not maintain any history. We will maintain only current record by updating existing record. Technically type1 is insert else update.In our example, recall we originally have the following table: Customer KeyNameState

1001ChristinaIllinois

After Christina moved from Illinois to California, the new information replaces the new record, and we have the following table: Customer KeyNameState

1001ChristinaCalifornia

Advantages: - This is the easiest way to handle the Slowly Changing Dimension problem, since there is no need to keep track of the old information. Disadvantages: All history is lost. By applying this methodology, it is not possible to trace back in history. For example, in this case, the company would not be able to know that Christina lived in Illinois before. Usage: About 50% of the time.When to use Type 1: Type 1 slowly changing dimension should be used when it is not necessary for the data warehouse to keep track of historical changes. 2) Type2: In case of type2 we will maintain historical data. Whenever there is a change in address will insert new address in address table and end dated the existing record. To maintain the versioning in SCD type2 we use a) start date, end date b) flag c) Address key or version numberIn our example, recall we originally have the following table: Customer KeyNameState

1001ChristinaIllinois

After Christina moved from Illinois to California, we add the new information as a new row into the table: Customer KeyNameState

1001ChristinaIllinois

1005ChristinaCalifornia

Advantages: - This allows us to accurately keep all historical information. Disadvantages: - This will cause the size of the table to grow fast. In cases where the number of rows for the table is very high to start with, storage and performance can become a concern. - This necessarily complicates the ETL process. Usage: About 50% of the time. When to use Type 2: Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to track historical changes. 3) Type3: In case of scd type3 we will maintain the current record and previous record. It means one customer can have maximum two records.In our example, recall we originally have the following table: Customer KeyNameState

1001ChristinaIllinois

To accommodate Type 3 Slowly Changing Dimension, we will now have the following columns: Customer Key Name Original State Current State Effective Date After Christina moved from Illinois to California, the original information gets updated, and we have the following table (assuming the effective date of change is January 15, 2003): Customer KeyNameOriginal StateCurrent StateEffective Date

1001ChristinaIllinoisCalifornia15-JAN-2003

Advantages: - This does not increase the size of the table, since new information is updated. - This allows us to keep some part of history. Disadvantages: Type 3 will not be able to keep all history where an attribute is changed more than once. For example, if Christina later moves to Texas on December 15, 2003, the California information will be lost. Usage: Type 3 is rarely used in actual practice. When to use Type 3: Type III slowly changing dimension should only be used when it is necessary for the data warehouse to track historical changes, and when such changes will only occur for a finite number of time.9. What is Confirmed dimension? Give example?A Dimension which is used two or more fact tables is called conformed dimension.Ex: Time, Product, Country etc.10. What is Degenerated Dimension? Give example?By nature it is a dimension but it will be located in fact table. Degenerated dimension contains only a key and no attributes.Ex: Bill number, Receipt number etc.11. What is Junk Dimension? Give example?A junk dimension is collection of random transactional codes, flags and/or text attributes that are unrelated to any particular dimension.Ex: Yes or No, Male or Female etc.12. What is Dirty Dimension? Give example?If a record occurs more than one time in a table by the difference of a non-key attribute then such dimension is called dirty dimension.For example A "dirty dimension" is one in which data quality cannot be guaranteed. For example, in most banks, account-oriented source applications contain data about the same customer multiple times. Many banks attempt to derive a "customer" by matching names and addresses across account applications, but this process results in more than one entry for each bank customer. Similarly, different attributes must be held for each of a bank's heterogeneous products. Attributes that are meaningful for a loan, such as term, credit risk assessment, and collateral, have no meaning for savings, checking, or investment products.13. What is Role play dimension?Dimension will pay multiple roles in the fact table like orders, sales, shipment date, Invoice date etc.Ex: Time dimension etc.14. What is lately arrived Dimension?In data warehousing some-times we receive fact data first and later we receive dimension data this kind of dimension is called lately arrived dimension.15. What is rapidly changing dimension?It is dimension that changes rapidly with respect to time period.Ex: share price etc.16. What are different types of slowly changing dimensions?Refer Q8.17. What is the difference between type1, type2 and type3?Refer Q818. How do we maintain the versioning in type2 dimension?Refer Q8.19. What are different types of Facts? Give Examples?These are three types1) Additive : These are facts on which we can perform arthematic operation like addition on the fact tableEx: If we want find yearly sales we can sum quarter sales2) Semi additive : These are fact we can perform sum operation for some of the dimension not to othersEx: In bank customer account we can find out monthly balance using credits and debits happen in the particular month. We cant find monthly balances using daily transactions.3) Non-additive: These are facts we cant be summed up for any dimension present in the fact tableEx: Temperature, Ratios etc.20. What is fact less fact table? Give the example?Fact less fact table captures many to many relation between dimensions but contains with-out any measures Common examples of fact less fact tables include: Identifying product promotion events (to determine promoted products that didnt sell) Tracking student attendance or registration events Tracking insurance-related accident events

21. What are steps involved in dimensional modelling?Dimensional modelling involved in three steps:Conceptual Data ModelFeatures of conceptual data model include: Includes the important entities and the relationships among them. No attribute is specified. No primary key is specified. At this level, the data modeller attempts to identify the highest-level relationships among the different entities. Logical Data ModelFeatures of logical data model include: Includes all entities and relationships among them. All attributes for each entity are specified. The primary key for each entity specified. Foreign keys (keys identifying the relationship between different entities) are specified. Normalization occurs at this level. At this level, the data modeller attempts to describe the data in as much detail as possible, without regard to how they will be physically implemented in the database. Physical Data ModelFeatures of physical data model include: Specification all tables and columns. Foreign keys are used to identify relationships between tables. Demoralization may occur based on user requirements. komPhysical considerations may cause the physical data model to be quite different from the logical data model. this level, the data modeller will specify how the logical data model will be realized in the database schema.

22. What is the difference between logical and physical modelling?Refer Q21.23. What is difference between start schema and snowflake?Star SchemaSnow Flake Schema

The star schema is the simplest data warehouse scheme.Snowflake schema is a more complex data warehouse model than a star schema.

In star schema each of the dimensions is represented in a single table .It should not have any hierarc.lhies between dims.In snow flake schema at least one hierarchy should exists between dimension tables.

It contains a fact table surrounded by dimension tables. If the dimensions are de-normalized, we say it is a star schema design.It contains a fact table surrounded by dimension tables. If a dimension is normalized, we say it is a snow flaked design.

In star schema only one join establishes the relationship between the fact table and any one of the dimension tables. So query performance is goodIn snow flake schema since there is relationship between the dimensions tables it has to do many joins to fetch the data. So query performance is poor

A star schema optimizes the performance by keeping queries simple and providing fast response time. All the information about the each level is stored in one row.Snowflake schemas normalize dimensions to eliminated redundancy. The result is more complex queries and reduced query performance.

It is called a star schema because the diagram resembles a star.It is called a snowflake schema because the diagram resembles a snowflake.

It occupies more space compare snowflake schema because it is DE normalized Snowflake schema occupies less space when compare to star schema because it is normalized

24. Why query performance is better in Start Schema?In star schema there will be less number of dimension tables so less number of joins so query performance is good.25. Which design Start Schema or snow flake is normalized one?Snow flake design is normalized because snow flake design split one dimension table into multiple dimensions to achieve the dimension.26. What is the difference between the full load and incremental load?Full loadIncremental load

First time in the data warehouse we load all the data from source into DWH is called full loadCapture the changed data to load DWH we use incremental load

Full load we prefer load data first timeSecond time onwards we perform incremental load in DWH

By performing the incremental load we are trying to load less number of data to improve performance

27. How can we do the incremental load from source?Capture the changed data in source then we load changed data to DWH.28. What is surrogate key? Give an example where did you use it in your project?It is the system generated sequence number used for maintaining the unique ness or primary keyIn case of SCD Type2 to maintain the versioning number we use surrogate key .29. What is Staging area? Why do we use staging area?If target and0 source databases are different and target table volume is high it contains some millions of records in this scenario without staging table we need to design your Informatica using look up to find out whether the record exists or not in the target table since target has huge volumes so its costly to create cache it will hit the performance.

If we create staging tables in the target database we can simply do outer join in the source qualifier to determine insert/update this approach will give you good performance.

It will avoid full table scan to determine insert/updates on target.And also we can create index on staging tables since these tables were designed for specific application it will not impact to any other schemas/users.

While processing flat files to data warehousing we can perform cleansing. Data cleansing, also known as data scrubbing, is the process of ensuring that a set of data is correct and accurate. During data cleansing, records are checked for accuracy and consistency.

Since it is one-to-one mapping from ODS to staging we do truncate and reload. We can create indexes in the staging state, to perform our source qualifier best. If we have the staging area no need to relay on the informatics transformation to known whether the record exists or not.30. What is data cleaning?Weeding out unnecessary or unwanted things (characters and spaces etc.) from incoming data to make it more meaningful and informative31. What is the difference between business key and Surrogate key?Business KeySurrogate key

A business key or natural key is an index which identifies unique ness of a row based on column that exists naturally in a table according to business rulesIt is the system generated sequence number used for maintaining the unique ness or primary key

Business are a good way of avoiding duplicate records so in practice I typically add business key as an alternative key in a dimensionIt is alternative of business key. Surrogate key are most efficient then business key so I would always recommended joining a fact table to a dimension table using surrogate key

I want to manage the patient information using business key will add the patient_id and patient_code this is different for each hospitalIn case of SCD Type2 to maintain the versioning number we use surrogate key.

32. What is normalization? Why do we normalize the data?1) Spiting of one table into multiple table is called normalization.2) It is used to eliminate the data redundancy.3) It also used to eliminate the duplicates 33. What are aggregate tables?Aggregation is a summarization of base-level fact table records Aggregations need to account for the additive nature of the measures, created on-the-fly or by pre-aggregationCommon aggregation scenariosa. Category product aggs by store by dayb. District store aggs by product by dayc. Monthly sales aggs by product by stored. Category product aggs by store district by dayCommon aggregations = Sum, Count, Distinct Count, Max, Min, Average, etc.

34. What is granularity?The granularity is the lowest level of information stored in the fact table. The depth of data level is known as granularity. In date dimension the level could be year, month, quarter, period, week, day of granularity.

The process consists of the following two steps:- Determining the dimensions that are to be included- Determining the location to place the hierarchy of each dimension of information

35. What is the difference between view and materialized view?ViewMaterialized View

View is a logical objectIt is physical object

It will not store any dataIt will store summary or aggregated data

It is used for security purpose and projectionIt is used for reporting purpose

We can perform DML operation on simple viewsIn this we cant perform DML operations

View will fetch the data from base table it will run the base queryIt will have less number of records compare to base tables performance will be improved in the reports

There is no refresh clause in the viewsWe can refresh the materialized views with refresh clause

If we create view on single table we can call as simple view and we create view on more tables it is called as complex view

36. Explain a situation where snowflake is better than start schema?Snowflake is occupies less amount of space when compare to star schema.It is Normalized schema37. What are different architectures of data warehouse?38. What is the size of your DWH?3 or 4 or 5 TBs39. What type of indexes normally used in DWH?In data warehouse normally we use B-tree and Bit Map indexesB-Tree:1) It will create on the columns with highest cardinality columns2) It is default index created by oracle database.Bit Map:1) It will create on lowest cardinality columns.40. What is loading order of dimension and facts?My understanding says that the dimensions should be extract first and fact should be extracted. That way the foreign keys still be honoured in the staging area.41. What happens if we try to load the fact tables before loading the dimension tables?If we try to load the fact tables before loading the dimension tables data will be rejected.42. What is a Schema?Graphical Representation of the data structure is called as schema.