2 data, data everywhere yet... we can’t find the data we need data is scattered over the network...

33

Upload: haylie-hazley

Post on 15-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get
Page 2: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get

22

Data, Data everywhereData, Data everywhereyet ...yet ...

We can’t find the data we needWe can’t find the data we need data is scattered over the networkdata is scattered over the network

We can’t get the data we needWe can’t get the data we need need an expert to get the data

We can’t understand the data We can’t understand the data we foundwe found available data is poorly

documented

We can’t use the data we foundWe can’t use the data we found data needs to be transformed from

one form to other

Page 3: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get

What is Data Warehouse?What is Data Warehouse?

Definition by InmonDefinition by Inmon ““A data warehouse is a A data warehouse is a subject-orientedsubject-oriented,, integrated, time-variantintegrated, time-variant, ,

and and non-volatilenon-volatile collection of data in support of management’s collection of data in support of management’s decision-making process”decision-making process”

Page 4: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get

Data Warehouse—Subject-OrientedData Warehouse—Subject-Oriented

Organized around major subjects, such as Organized around major subjects, such as customer, customer,

product, salesproduct, sales

Page 5: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get

Data Warehouse—IntegratedData Warehouse—Integrated

Constructed by integrating multiple, Constructed by integrating multiple, heterogeneous data sourcesheterogeneous data sources relational databases, flat files, on-line transaction relational databases, flat files, on-line transaction

recordsrecords

Data cleaning and data integration techniques Data cleaning and data integration techniques are appliedare applied Ensure consistency in naming conventions, attribute Ensure consistency in naming conventions, attribute

measures, etc. among different data sourcesmeasures, etc. among different data sources When data is moved to the warehouse, it is When data is moved to the warehouse, it is

converted converted

Page 6: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get

Data Warehouse—Time VariantData Warehouse—Time Variant

The time horizon for the data warehouse is significantly The time horizon for the data warehouse is significantly

longer than that of operational systemslonger than that of operational systems Operational database: current value dataOperational database: current value data

Data warehouse data: provide information from a historical Data warehouse data: provide information from a historical

perspective (e.g., past 5-10 years)perspective (e.g., past 5-10 years)

Page 7: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get

Data Warehouse—Non-VolatileData Warehouse—Non-Volatile

Operational Operational update of data does not occurupdate of data does not occur in in

the data warehouse environmentthe data warehouse environment Requires only two operations in data accessing: Requires only two operations in data accessing:

initial loading of datainitial loading of data and and access of dataaccess of data

Page 8: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get

Data Warehouse vs. Operational DBMSData Warehouse vs. Operational DBMS

OLTP (On-Line Transaction Processing)OLTP (On-Line Transaction Processing) Major task of traditional relational DBMSMajor task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking, Day-to-day operations: purchasing, inventory, banking,

manufacturing, payroll, registration, accounting, etc.manufacturing, payroll, registration, accounting, etc.

OLAP (On-Line Analytical Processing)OLAP (On-Line Analytical Processing) Major task of data warehouse systemMajor task of data warehouse system Data analysis and decision makingData analysis and decision making

Page 9: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get

From Tables and Spreadsheets to From Tables and Spreadsheets to Data CubesData Cubes

A data warehouse is based on A data warehouse is based on multidimensional data model which views data in the form of a data multidimensional data model which views data in the form of a data

cubecube

A data cube allows data to be modeled and viewed in multiple A data cube allows data to be modeled and viewed in multiple dimensions (dimensions (such as salessuch as sales))

Dimension tables, such as Dimension tables, such as item (item_name, brand, type), or time(day, item (item_name, brand, type), or time(day, week, month, quarter, year)week, month, quarter, year)

Fact table contains measures (such as Fact table contains measures (such as dollars_solddollars_sold) and keys to each ) and keys to each of the related dimension tablesof the related dimension tables

Page 10: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get

Conceptual Modeling of Data WarehousesConceptual Modeling of Data Warehouses

Modeling data warehouses: dimensions & measuresModeling data warehouses: dimensions & measures

Star schemaStar schema A fact table in the middle connected to a set of dimension tablesA fact table in the middle connected to a set of dimension tables

Snowflake schemaSnowflake schema A refinement of star schema where some dimensional hierarchy is A refinement of star schema where some dimensional hierarchy is

normalized into a set of smaller dimension tables, forming a normalized into a set of smaller dimension tables, forming a

shape similar to snowflakeshape similar to snowflake

Fact constellationsFact constellations Multiple fact tables share dimension tables, viewed as a collection Multiple fact tables share dimension tables, viewed as a collection

of stars, therefore called of stars, therefore called galaxy schemagalaxy schema or fact constellation or fact constellation

Page 11: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get

Avg_sales

Euros_sold

Unit_sold

Location_keybranch_keybranch_namebranch_type

Example of Star SchemaExample of Star Schema

time_keydayday_of_the_weekmonthquarteryear

location_keystreetcityprovince_or_streetcountry

Measures

item_keyitem_namebrandtypesupplier_type

Branch_key

Branch

Time

Item

Location

Sales Fact Table

Item_key

Time_key

Page 12: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get

branch_keybranch_namebranch_type

Example of Snowflake SchemaExample of Snowflake Schema

time_keydayday_of_the_weekmonthquarteryear

Measures

item_keyitem_namebrandtypesupplier_key

Branch

Time

Item

location_keystreetcity_key

Location

Sales Fact Table

Avg_sales

Euros_sold

Unit_sold

Location_key

Branch_key

Item_key

Time_key

supplier_keysupplier_type

city_keycityprovince_or_streetcountry

City

Supplier

Page 13: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get

branch_keybranch_namebranch_type

Example of Fact ConstellationExample of Fact Constellation

time_keydayday_of_the_weekmonthquarteryear

Measures

Branch

Time

item_keyitem_namebrandtypesupplier_key

Item

location_keystreetcityProvince/streetcountry

Location

Sales Fact Table

Avg_sales

Euros_sold

Unit_sold

Location_key

Branch_key

Item_key

Time_key

shipper_keyshipper_namelocation_keyshipper_type

shipper

unit_shipped

Euros_sold

to_location

from_location

shipper_key

Item_key

Time_key

Shipping Fact Table

Page 14: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get

A Sample Data CubeA Sample Data CubeTotal annual salesof TV in IrelandDate

Produ

ct

Cou

ntr

ysum

sum TV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

Ireland

France

Germany

sum

Page 15: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get

Typical OLAP OperationsTypical OLAP Operations

Roll up (drill-up): Roll up (drill-up): summarize datasummarize data by climbing up hierarchy or by dimension reductionby climbing up hierarchy or by dimension reduction

Drill down (roll down): Drill down (roll down): reverse of roll-upreverse of roll-up from higher level summary to lower level summary or detailed data, or from higher level summary to lower level summary or detailed data, or

introducing new dimensionsintroducing new dimensions

Slice and diceSlice and dice project and selectproject and select

Pivot (rotate) Pivot (rotate) reorient the cube, visualization, 3D to series of 2D planes.reorient the cube, visualization, 3D to series of 2D planes.

Page 16: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get

1616

Data Warehouse ArchitectureData Warehouse Architecture

Data Warehouse Engine

Optimized Loader

ExtractionCleansing

AnalyzeQuery

Metadata Repository

RelationalDatabases

LegacyData

Purchased Data

ERPSystems

Page 17: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get

Data Warehouse ArchitectureData Warehouse ArchitectureData Extraction - Data Extraction involves gathering the data from multiple heterogeneous sources.

Data Cleaning - Data Cleaning involves finding and correcting the errors in data.

Data Transformation - Data Transformation involves converting data from legacy format to warehouse format.

Data Loading - Data Loading involves sorting, summarizing, consolidating, checking integrity and building indices and partitions.

Refreshing - Refreshing involves updating from data sources to warehouse.

Page 18: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get

Data Warehouse ModelsData Warehouse Models

Enterprise warehouseEnterprise warehouse collects all of the information about subjects spanning the entire collects all of the information about subjects spanning the entire

organizationorganization

Data MartData Mart a subset of corporate-wide data that is of value to a specific groups a subset of corporate-wide data that is of value to a specific groups

of users. Its scope is confined to specific, selected groups, such as of users. Its scope is confined to specific, selected groups, such as marketing data martmarketing data mart

Page 19: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get

Introduction to Data Introduction to Data MiningMining

Page 20: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get

What Motivated Data Mining?What Motivated Data Mining?

We are drowning in data, but starving for We are drowning in data, but starving for knowledge!knowledge!

Page 21: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get

2121

What Is Data Mining?What Is Data Mining?

Data mining (knowledge discovery from data) Data mining (knowledge discovery from data) Extraction of interesting (Extraction of interesting (implicit, previously unknown and implicit, previously unknown and

potentially useful) patterns or knowledge from huge amount of potentially useful) patterns or knowledge from huge amount of datadata

Alternative namesAlternative names Knowledge discovery (mining) in databases (KDD), knowledge Knowledge discovery (mining) in databases (KDD), knowledge

extraction, data/pattern analysis, data archeology, data extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.dredging, information harvesting, business intelligence, etc.

Page 22: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get

2222

Why Data Mining?—Potential Why Data Mining?—Potential ApplicationsApplications

Data analysis and decision supportData analysis and decision support Market analysis and managementMarket analysis and management

Target marketing, customer relationship management (CRM), Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentationmarket basket analysis, cross selling, market segmentation

Risk analysis and managementRisk analysis and managementForecasting, customer retention, quality control, competitive Forecasting, customer retention, quality control, competitive

analysisanalysis Fraud detection and detection of unusual patterns (outliers)Fraud detection and detection of unusual patterns (outliers)

Page 23: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get

2323

Integration of Multiple Integration of Multiple TechnologiesTechnologies

MachineLearning

DatabaseManagement

ArtificialIntelligence

Statistics

DataMining

VisualizationAlgorithms

Page 24: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get

2424

What Can Data Mining Do?What Can Data Mining Do?

ClusterClusterClassifyClassify Categorical, RegressionCategorical, Regression

SummarizeSummarize Summary statistics, Summary rulesSummary statistics, Summary rules

Link Analysis / Model DependenciesLink Analysis / Model Dependencies Association rulesAssociation rules

Detect DeviationsDetect Deviations

Page 25: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get

2525

ClusteringClustering

Find groups of Find groups of similar data similar data itemsitems

““Group people with Group people with similar travel profiles”similar travel profiles” George, PatriciaGeorge, Patricia Jeff, Evelyn, ChrisJeff, Evelyn, Chris RobRob

Clusters

Page 26: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get

2626

ClassificationClassification

Find ways to separate Find ways to separate data items into pre-data items into pre-defined groupsdefined groupsA bank loan officer wants A bank loan officer wants to analyse the data in to analyse the data in order to know which order to know which customer (loan applicant) customer (loan applicant) are risky or which are are risky or which are safe.safe.

Groups

Training Data

tool produces

classifier

Page 27: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get

2727

Association RulesAssociation Rules

Identify dependencies in the Identify dependencies in the data:data:

X makes Y likelyX makes Y likely

Indicate significance of each Indicate significance of each dependencydependency

““Find groups of items Find groups of items commonly purchased commonly purchased together”together”

People who purchase X People who purchase X are likely to purchase Yare likely to purchase Y

Page 28: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get

2828

Deviation DetectionDeviation Detection

Find unexpected Find unexpected values,values,

Uses:Uses:Failure analysisFailure analysisAnomaly discovery for Anomaly discovery for analysisanalysis

““Find unusual Find unusual occurrences in stock occurrences in stock prices”prices”

Page 29: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get

Knowledge Discovery (KDD) ProcessKnowledge Discovery (KDD) Process

Data mining—core of Data mining—core of knowledge discovery knowledge discovery processprocess

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

Page 30: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get

Knowledge ProcessKnowledge Process

1.1. Data cleaningData cleaning – to remove noise and inconsistent – to remove noise and inconsistent datadata

2.2. Data integrationData integration – to combine multiple source – to combine multiple source 3.3. Data selectionData selection – to retrieve relevant data for analysis – to retrieve relevant data for analysis4.4. Data transformationData transformation – to transform data into – to transform data into

appropriate form for data miningappropriate form for data mining5.5. Data miningData mining6.6. EvaluationEvaluation7.7. Knowledge presentationKnowledge presentation

Page 31: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get

Knowledge ProcessKnowledge Process

Although data mining is only one step in Although data mining is only one step in the entire process, it is an essential one the entire process, it is an essential one since it uncovers hidden patterns for since it uncovers hidden patterns for evaluation evaluation

Page 32: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get

Knowledge ProcessKnowledge Process

Based on this view, the architecture of a typical Based on this view, the architecture of a typical data mining system may have the following data mining system may have the following major components:major components:

Database, data warehouse, world wide web, or other Database, data warehouse, world wide web, or other information repositoryinformation repository

Database or data warehouse serverDatabase or data warehouse server Data mining engineData mining engine Pattern evaluation modelPattern evaluation model User interface User interface

Page 33: 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get