2. data preparation and preprocessing
DESCRIPTION
2. Data Preparation and Preprocessing. Data and Its Forms Preparation Preprocessing and Data Reduction. Data Types and Forms. Attribute-vector data: Data types numeric, categorical ( see the hierarchy for its relationship ) static, dynamic (temporal) Other data forms distributed data - PowerPoint PPT PresentationTRANSCRIPT
1
2. Data Preparation and Preprocessing
Data and Its FormsPreparation
Preprocessing and Data Reduction
2/4/03 CSE 575 Data Mining by H. Liu 2
Data Types and Forms
A1 A2 … An C
Attribute-vector data: Data types
numeric, categorical (see the hierarchy for its relationship)
static, dynamic (temporal) Other data forms
distributed data text, Web, meta data images, audio/video You have seen most of
them after the invited talks.
2/4/03 CSE 575 Data Mining by H. Liu 3
Data Preparation
An important & time consuming task in KDD
High dimensional data (20, 100, 1000) Huge size data Missing data Outliers Erroneous data (inconsistent,
misrecorded, distorted) Raw data
2/4/03 CSE 575 Data Mining by H. Liu 4
Data Preparation Methods
Data annotation as in driving data analysis
Data normalization Another example is of image mining
Dealing with sequential or temporal data Transform it to tabular form
Removing outliers Different types
2/4/03 CSE 575 Data Mining by H. Liu 5
Normalization
Decimal scaling v’(i) = v(i)/10k for the smallest k such that max(|v’(i)|)<1. For the range between -991 and 99, k is 1000, -991 .991
Min-max normalization into the new max/min range: v’ = (v - minA)/(maxA - minA) *
(new_maxA - new_minA) + new_minA v = 73600 in [12000,98000] v’= 0.716 in [0,1] (new
range) Zero-mean normalization:
v’ = (v - meanA) / std_devA (1, 2, 3), mean and std_dev are 2 and 1, (-1, 0, 1) If meanIncome = 54000 and std_devIncome = 16000,
then v = 73600 1.225
2/4/03 CSE 575 Data Mining by H. Liu 6
Temporal Data The goal is to forecast t(n+1) from previous
values X = {t(1), t(2), …, t(n)}
An example with two features and widow size 3 How to determine the window size?Time
A B
1 7 215
2 10 211
3 6 214
4 11 221
5 12 210
6 14 218
Inst A(n-2)
A(n-1)
A(n) B(n-2)
B(n-1)
B(n)
1 7 10 6 215 211 214
2 10 6 11 211 214 221
3 6 11 12 214 221 210
4 11 12 14 221 210 218
2/4/03 CSE 575 Data Mining by H. Liu 7
Outlier Removal
Data points inconsistent with the majority of data
Different outliers Valid: CEO’s salary, Noisy: One’s age = 200, widely deviated points
Removal methods Clustering Curve-fitting Hypothesis-testing with a given model
2/4/03 CSE 575 Data Mining by H. Liu 8
Data Preprocessing
Data cleaning missing data noisy data inconsistent data
Data reduction Dimensionality reduction Instance selection Value discretization
2/4/03 CSE 575 Data Mining by H. Liu 9
Missing Data
Many types of missing data not measured truly missed wrongly placed, and ?
Some methods leave as is ignore/remove the instance with missing value manual fix (assign a value for implicit meaning) statistical methods (majority, most likely,mean,
nearest neighbor, …)
2/4/03 CSE 575 Data Mining by H. Liu 10
Noisy Data
Random error or variance in a measured variable inconsistent values for features or classes (process) measuring errors (source)
Noise is normally a minority in the data set Why?
Removing noise Clustering/merging Smoothing (rounding, averaging within a window) Outlier detection (deviation-based or distance-
based)
2/4/03 CSE 575 Data Mining by H. Liu 11
Inconsistent Data
Inconsistent with our models or common sense
Examples The same name occurs differently in an application Different names appear the same (Dennis vs.
Denis) Inappropriate values (Male-Pregnant, negative
age) One bank’s database shows that 5% of its
customers were born in 11/11/11 …
2/4/03 CSE 575 Data Mining by H. Liu 12
Dimensionality Reduction
Feature selection select m from n features, m≤ n remove irrelevant, redundant features the saving in search space
Feature transformation (PCA) form new features (a) in a new domain from
original features (f) many uses, but it does not reduce the original
dimensionality often used in visualization of data
2/4/03 CSE 575 Data Mining by H. Liu 13
Feature Selection
Problem illustration Full set Empty set Enumeration
Search Exhaustive/Complete (Enumeration/BAA) Heuristic (Sequential forward/backward) Stochastic (generate/evaluate) Individual features or subsets
generation/evaluation
2/4/03 CSE 575 Data Mining by H. Liu 14
Feature Selection (2)
Goodness metrics Dependency: depending on classes Distance: separating classes Information: entropy Consistency: 1 - #inconsistencies/N
Example: (F1, F2, F3) and (F1,F3) Both sets have 2/6 inconsistency
rate Accuracy (classifier based): 1 -
errorRate Their comparisons
Time complexity, number of features, removing redundancy
F1
F2
F3
C
0 0 1 1
0 0 1 0
0 0 1 1
1 0 0 1
1 0 0 0
1 0 0 0
2/4/03 CSE 575 Data Mining by H. Liu 15
Feature Selection (3)
Filter vs. Wrapper Model Pros and cons
time generality performance such as accuracy
Stopping criteria thresholding (number of iterations, some
accuracy,…) anytime algorithms
providing approximate solutions solutions improve over time
2/4/03 CSE 575 Data Mining by H. Liu 16
Feature Selection (Examples) SFS using consistency (cRate)
select 1 from n, then 1 from n-1, n-2,… features increase the number of selected features until
pre-specified cRate is reached. LVF using consistency (cRate)
1 randomly generate a subset S from the full set2 if it satisfies prespecified cRate, keep S with min
#S3 go back to 1 until a stopping criterion is met
LVF is an any time algorithm Many other algorithms: SBS, B&B, ...
2/4/03 CSE 575 Data Mining by H. Liu 17
Transformation: PCA
D’ = DA, D is mean-centered, (Nn)
Calculate and rank eigenvalues of the covariance matrix
Select largest ’s such that r > threshold (e.g., .95)
corresponding eigenvectors form A (nm)
Example of Iris data
E-values Diff Prop Cumu
1 2.91082 1.98960
0.72771
0.72770
2 0.92122 0.77387
0.23031
0.95801
3 0.14735 0.12675
0.03684
0.99485
4 0.02061 0.00515
1.00000
V1 V2 V3 V4
F1 0.522372
0.372318
-.721017 -.261996
F2 -.263355
0.925556
0.242033
0.124135
F3 0.581254
0.021095
0.140892
0.801154
F4 0.565611
0.065416
0.633801
-.523546
m n
r = ( i ) / ( i )
i=1 i=1
2/4/03 CSE 575 Data Mining by H. Liu 18
Instance Selection
Sampling methods random sampling stratified sampling
Search-based methods Representatives Prototypes Sufficient statistics (N, mean, stdDev) Support vectors
2/4/03 CSE 575 Data Mining by H. Liu 19
Value Descritization
Binning methods Equal-width Equal-frequency Class information is not used
Entropy-based ChiMerge
Chi2
2/4/03 CSE 575 Data Mining by H. Liu 20
Binning
Attribute values (for one attribute e.g., age): 0, 4, 12, 16, 16, 18, 24, 26, 28
Equi-width binning – for bin width of e.g., 10: Bin 1: 0, 4 [-,10) bin Bin 2: 12, 16, 16, 18 [10,20) bin Bin 3: 24, 26, 28 [20,+) bin We use – to denote negative infinity, + for positive
infinity Equi-frequency binning – for bin density of e.g., 3:
Bin 1: 0, 4, 12 [-,14) bin Bin 2: 16, 16, 18 [14,21) bin Bin 3: 24, 26, 28 [21,+] bin
Any problems with the above methods?
2/4/03 CSE 575 Data Mining by H. Liu 21
Entropy-based
Given attribute-value/class pairs: (0,P), (4,P), (12,P), (16,N), (16,N), (18,P), (24,N), (26,N),
(28,N) Entropy-based binning via binarization:
Intuitively, find best split so that the bins are as pure as possible
Formally characterized by maximal information gain. Let S denote the above 9 pairs, p=4/9 be fraction
of P pairs, and n=5/9 be fraction of N pairs. Entropy(S) = - p log p - n log n.
Smaller entropy – set is relatively pure; smallest is 0. Large entropy – set is mixed. Largest is 1.
2/4/03 CSE 575 Data Mining by H. Liu 22
Entropy-based (2)
Let v be a possible split. Then S is divided into two sets: S1: value <= v and S2: value > v
Information of the split: I(S1,S2) = (|S1|/|S|) Entropy(S1)+ (|S2|/|S|) Entropy(S2)
Information gain of the split: Gain(v,S) = Entropy(S) – I(S1,S2)
Goal: split with maximal information gain. Possible splits: mid points b/w any two consecutive
values. For v=14, I(S1,S2) = 0 + 6/9*Entropy(S2) = 6/9 * 0.65 = 0.433 Gain(14,S) = Entropy(S) - 0.433
maximum Gain means minimum I. The best split is found after examining all possible split
points.
2/4/03 CSE 575 Data Mining by H. Liu 23
Given attribute-value/class pairs Build a contingency table for
every pair of intervals (I) Chi-Squared Test (goodness-of-
fit),
Parameters: df = k-1 and p% level of significance
Chi2 algorithm provides an automatic way to adjust p
ChiMerge and Chi2
F C
12 P
12 N
12 P
16 N
16 N
16 P
24 N
24 N
24 N
C1 C2
I-1 A11 A12 R1
I-2 A21 A22 R2
C1 C2 N
2 k
2 = (Aij – Eij)2 / Eij
i=1 j=1
2/4/03 CSE 575 Data Mining by H. Liu 24
Summary
Data have many forms Attribute-vectors is the most common form
Raw data need to be prepared and preprocessed for data mining Data miners have to work on the data provided Domain expertise is important in DPP
Data preparation: Normalization, Transformation
Data preprocessing: Cleaning and Reduction DPP is a critical and time-consuming task
Why?
2/4/03 CSE 575 Data Mining by H. Liu 25
Bibliography
H. Liu & H. Motoda, 1998. Feature Selection for Knowledge Discovery and Data Mining. Kluwer.
M. Kantardzic, 2003. Data Mining - Concepts, Models, Methods, and Algorithms. IEEE and Wiley Inter-Science.
H. Liu & H. Motoda, edited, 2001. Instance Selection and Construction for Data Mining. Kluwer.
H. Liu, F. Hussain, C.L. Tan, and M. Dash, 2002. Discretization: An Enabling Technique. DMKD 6:393-423.