2. data preparation and preprocessing

25
1 2. Data Preparation and Preprocessing Data and Its Forms Preparation Preprocessing and Data Reduction

Upload: neola

Post on 05-Jan-2016

58 views

Category:

Documents


1 download

DESCRIPTION

2. Data Preparation and Preprocessing. Data and Its Forms Preparation Preprocessing and Data Reduction. Data Types and Forms. Attribute-vector data: Data types numeric, categorical ( see the hierarchy for its relationship ) static, dynamic (temporal) Other data forms distributed data - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: 2. Data Preparation and Preprocessing

1

2. Data Preparation and Preprocessing

Data and Its FormsPreparation

Preprocessing and Data Reduction

Page 2: 2. Data Preparation and Preprocessing

2/4/03 CSE 575 Data Mining by H. Liu 2

Data Types and Forms

A1 A2 … An C

Attribute-vector data: Data types

numeric, categorical (see the hierarchy for its relationship)

static, dynamic (temporal) Other data forms

distributed data text, Web, meta data images, audio/video You have seen most of

them after the invited talks.

Page 3: 2. Data Preparation and Preprocessing

2/4/03 CSE 575 Data Mining by H. Liu 3

Data Preparation

An important & time consuming task in KDD

High dimensional data (20, 100, 1000) Huge size data Missing data Outliers Erroneous data (inconsistent,

misrecorded, distorted) Raw data

Page 4: 2. Data Preparation and Preprocessing

2/4/03 CSE 575 Data Mining by H. Liu 4

Data Preparation Methods

Data annotation as in driving data analysis

Data normalization Another example is of image mining

Dealing with sequential or temporal data Transform it to tabular form

Removing outliers Different types

Page 5: 2. Data Preparation and Preprocessing

2/4/03 CSE 575 Data Mining by H. Liu 5

Normalization

Decimal scaling v’(i) = v(i)/10k for the smallest k such that max(|v’(i)|)<1. For the range between -991 and 99, k is 1000, -991 .991

Min-max normalization into the new max/min range: v’ = (v - minA)/(maxA - minA) *

(new_maxA - new_minA) + new_minA v = 73600 in [12000,98000] v’= 0.716 in [0,1] (new

range) Zero-mean normalization:

v’ = (v - meanA) / std_devA (1, 2, 3), mean and std_dev are 2 and 1, (-1, 0, 1) If meanIncome = 54000 and std_devIncome = 16000,

then v = 73600 1.225

Page 6: 2. Data Preparation and Preprocessing

2/4/03 CSE 575 Data Mining by H. Liu 6

Temporal Data The goal is to forecast t(n+1) from previous

values X = {t(1), t(2), …, t(n)}

An example with two features and widow size 3 How to determine the window size?Time

A B

1 7 215

2 10 211

3 6 214

4 11 221

5 12 210

6 14 218

Inst A(n-2)

A(n-1)

A(n) B(n-2)

B(n-1)

B(n)

1 7 10 6 215 211 214

2 10 6 11 211 214 221

3 6 11 12 214 221 210

4 11 12 14 221 210 218

Page 7: 2. Data Preparation and Preprocessing

2/4/03 CSE 575 Data Mining by H. Liu 7

Outlier Removal

Data points inconsistent with the majority of data

Different outliers Valid: CEO’s salary, Noisy: One’s age = 200, widely deviated points

Removal methods Clustering Curve-fitting Hypothesis-testing with a given model

Page 8: 2. Data Preparation and Preprocessing

2/4/03 CSE 575 Data Mining by H. Liu 8

Data Preprocessing

Data cleaning missing data noisy data inconsistent data

Data reduction Dimensionality reduction Instance selection Value discretization

Page 9: 2. Data Preparation and Preprocessing

2/4/03 CSE 575 Data Mining by H. Liu 9

Missing Data

Many types of missing data not measured truly missed wrongly placed, and ?

Some methods leave as is ignore/remove the instance with missing value manual fix (assign a value for implicit meaning) statistical methods (majority, most likely,mean,

nearest neighbor, …)

Page 10: 2. Data Preparation and Preprocessing

2/4/03 CSE 575 Data Mining by H. Liu 10

Noisy Data

Random error or variance in a measured variable inconsistent values for features or classes (process) measuring errors (source)

Noise is normally a minority in the data set Why?

Removing noise Clustering/merging Smoothing (rounding, averaging within a window) Outlier detection (deviation-based or distance-

based)

Page 11: 2. Data Preparation and Preprocessing

2/4/03 CSE 575 Data Mining by H. Liu 11

Inconsistent Data

Inconsistent with our models or common sense

Examples The same name occurs differently in an application Different names appear the same (Dennis vs.

Denis) Inappropriate values (Male-Pregnant, negative

age) One bank’s database shows that 5% of its

customers were born in 11/11/11 …

Page 12: 2. Data Preparation and Preprocessing

2/4/03 CSE 575 Data Mining by H. Liu 12

Dimensionality Reduction

Feature selection select m from n features, m≤ n remove irrelevant, redundant features the saving in search space

Feature transformation (PCA) form new features (a) in a new domain from

original features (f) many uses, but it does not reduce the original

dimensionality often used in visualization of data

Page 13: 2. Data Preparation and Preprocessing

2/4/03 CSE 575 Data Mining by H. Liu 13

Feature Selection

Problem illustration Full set Empty set Enumeration

Search Exhaustive/Complete (Enumeration/BAA) Heuristic (Sequential forward/backward) Stochastic (generate/evaluate) Individual features or subsets

generation/evaluation

Page 14: 2. Data Preparation and Preprocessing

2/4/03 CSE 575 Data Mining by H. Liu 14

Feature Selection (2)

Goodness metrics Dependency: depending on classes Distance: separating classes Information: entropy Consistency: 1 - #inconsistencies/N

Example: (F1, F2, F3) and (F1,F3) Both sets have 2/6 inconsistency

rate Accuracy (classifier based): 1 -

errorRate Their comparisons

Time complexity, number of features, removing redundancy

F1

F2

F3

C

0 0 1 1

0 0 1 0

0 0 1 1

1 0 0 1

1 0 0 0

1 0 0 0

Page 15: 2. Data Preparation and Preprocessing

2/4/03 CSE 575 Data Mining by H. Liu 15

Feature Selection (3)

Filter vs. Wrapper Model Pros and cons

time generality performance such as accuracy

Stopping criteria thresholding (number of iterations, some

accuracy,…) anytime algorithms

providing approximate solutions solutions improve over time

Page 16: 2. Data Preparation and Preprocessing

2/4/03 CSE 575 Data Mining by H. Liu 16

Feature Selection (Examples) SFS using consistency (cRate)

select 1 from n, then 1 from n-1, n-2,… features increase the number of selected features until

pre-specified cRate is reached. LVF using consistency (cRate)

1 randomly generate a subset S from the full set2 if it satisfies prespecified cRate, keep S with min

#S3 go back to 1 until a stopping criterion is met

LVF is an any time algorithm Many other algorithms: SBS, B&B, ...

Page 17: 2. Data Preparation and Preprocessing

2/4/03 CSE 575 Data Mining by H. Liu 17

Transformation: PCA

D’ = DA, D is mean-centered, (Nn)

Calculate and rank eigenvalues of the covariance matrix

Select largest ’s such that r > threshold (e.g., .95)

corresponding eigenvectors form A (nm)

Example of Iris data

E-values Diff Prop Cumu

1 2.91082 1.98960

0.72771

0.72770

2 0.92122 0.77387

0.23031

0.95801

3 0.14735 0.12675

0.03684

0.99485

4 0.02061 0.00515

1.00000

V1 V2 V3 V4

F1 0.522372

0.372318

-.721017 -.261996

F2 -.263355

0.925556

0.242033

0.124135

F3 0.581254

0.021095

0.140892

0.801154

F4 0.565611

0.065416

0.633801

-.523546

m n

r = ( i ) / ( i )

i=1 i=1

Page 18: 2. Data Preparation and Preprocessing

2/4/03 CSE 575 Data Mining by H. Liu 18

Instance Selection

Sampling methods random sampling stratified sampling

Search-based methods Representatives Prototypes Sufficient statistics (N, mean, stdDev) Support vectors

Page 19: 2. Data Preparation and Preprocessing

2/4/03 CSE 575 Data Mining by H. Liu 19

Value Descritization

Binning methods Equal-width Equal-frequency Class information is not used

Entropy-based ChiMerge

Chi2

Page 20: 2. Data Preparation and Preprocessing

2/4/03 CSE 575 Data Mining by H. Liu 20

Binning

Attribute values (for one attribute e.g., age): 0, 4, 12, 16, 16, 18, 24, 26, 28

Equi-width binning – for bin width of e.g., 10: Bin 1: 0, 4 [-,10) bin Bin 2: 12, 16, 16, 18 [10,20) bin Bin 3: 24, 26, 28 [20,+) bin We use – to denote negative infinity, + for positive

infinity Equi-frequency binning – for bin density of e.g., 3:

Bin 1: 0, 4, 12 [-,14) bin Bin 2: 16, 16, 18 [14,21) bin Bin 3: 24, 26, 28 [21,+] bin

Any problems with the above methods?

Page 21: 2. Data Preparation and Preprocessing

2/4/03 CSE 575 Data Mining by H. Liu 21

Entropy-based

Given attribute-value/class pairs: (0,P), (4,P), (12,P), (16,N), (16,N), (18,P), (24,N), (26,N),

(28,N) Entropy-based binning via binarization:

Intuitively, find best split so that the bins are as pure as possible

Formally characterized by maximal information gain. Let S denote the above 9 pairs, p=4/9 be fraction

of P pairs, and n=5/9 be fraction of N pairs. Entropy(S) = - p log p - n log n.

Smaller entropy – set is relatively pure; smallest is 0. Large entropy – set is mixed. Largest is 1.

Page 22: 2. Data Preparation and Preprocessing

2/4/03 CSE 575 Data Mining by H. Liu 22

Entropy-based (2)

Let v be a possible split. Then S is divided into two sets: S1: value <= v and S2: value > v

Information of the split: I(S1,S2) = (|S1|/|S|) Entropy(S1)+ (|S2|/|S|) Entropy(S2)

Information gain of the split: Gain(v,S) = Entropy(S) – I(S1,S2)

Goal: split with maximal information gain. Possible splits: mid points b/w any two consecutive

values. For v=14, I(S1,S2) = 0 + 6/9*Entropy(S2) = 6/9 * 0.65 = 0.433 Gain(14,S) = Entropy(S) - 0.433

maximum Gain means minimum I. The best split is found after examining all possible split

points.

Page 23: 2. Data Preparation and Preprocessing

2/4/03 CSE 575 Data Mining by H. Liu 23

Given attribute-value/class pairs Build a contingency table for

every pair of intervals (I) Chi-Squared Test (goodness-of-

fit),

Parameters: df = k-1 and p% level of significance

Chi2 algorithm provides an automatic way to adjust p

ChiMerge and Chi2

F C

12 P

12 N

12 P

16 N

16 N

16 P

24 N

24 N

24 N

C1 C2

I-1 A11 A12 R1

I-2 A21 A22 R2

C1 C2 N

2 k

2 = (Aij – Eij)2 / Eij

i=1 j=1

Page 24: 2. Data Preparation and Preprocessing

2/4/03 CSE 575 Data Mining by H. Liu 24

Summary

Data have many forms Attribute-vectors is the most common form

Raw data need to be prepared and preprocessed for data mining Data miners have to work on the data provided Domain expertise is important in DPP

Data preparation: Normalization, Transformation

Data preprocessing: Cleaning and Reduction DPP is a critical and time-consuming task

Why?

Page 25: 2. Data Preparation and Preprocessing

2/4/03 CSE 575 Data Mining by H. Liu 25

Bibliography

H. Liu & H. Motoda, 1998. Feature Selection for Knowledge Discovery and Data Mining. Kluwer.

M. Kantardzic, 2003. Data Mining - Concepts, Models, Methods, and Algorithms. IEEE and Wiley Inter-Science.

H. Liu & H. Motoda, edited, 2001. Instance Selection and Construction for Data Mining. Kluwer.

H. Liu, F. Hussain, C.L. Tan, and M. Dash, 2002. Discretization: An Enabling Technique. DMKD 6:393-423.