1 4 data reduction 응용화학부 송상옥. 2 발표순서 o data reduction 의 필요성 o...
TRANSCRIPT
1
4Data Reduction
응용화학부송상옥
2
발표순서
Data Reduction 의 필요성 Dimension Reduction 의 역할 및 형태 Dimension Reduction 의 구체적 방법
3
왜 필요한가 ?
데이터가 너무 많으면– 예측 프로그램의 용량 초과– 해를 구하는데 걸리는 시간 지연
적절한 양의 데이터– 데이터에 포함된 개념의 복잡도에 의존
(model 의 complexity)– mining 이전에 알 수 없다 .– Ex) random data
4
Dimension Reduction 의 역할
5
Dimension Reduction 의 형태 Delete a column (feature) Delete a row (case) Reduce the number of values in a
column (smooth a feature)
transformation to new data set(PCA)
6
Best Features Selection
Impossible !– Search space– computational time
approximation– promising subsets– simple distance
measure– using only training
error
7
Mean and Variance
Cases : a sample from some dist. Spreadsheet mean and variance BUT, Dist. is unknown
Heuristic Feature Selection Guidance
8
Independent Features
Classification problem
k classes classification– k pairwise comparison
Regression = pseudo-classification
sig
BAse
BmeanAmean
n
B
n
ABAse
21
varvar
9
Distance Based Selection
Independent analysis + correlation analysis detect redundancy
Distance measure
– Independent feature
Branch-and-Bound Algorithm
TM MMCCMMD 211
2121
iiimim 212
21 varvar
iFDFD MM ,
10
Heuristic Feature Selection
Comparison measures– Significant Test
– Dm
– F-Test
11
Principal Components
Merging features– a new set of fewer columns
first k-component First principal component
– minimum euclidean distance Feature with a large variance
– excellent chances for separation of class or group of case values
SPS
12
Decision Trees
Dynamic logic approach– coordinated with searching for
solution advantageous in large feature
spaces recursive partitioning
13
Reducing Values Problem
Clustering problem
14
Rounding
k
kk
k
iyix
iyiythenixif
ixiy
10
121010,mod
)10int(
15
K-Mean Clustering
16
Class Entropy
k
iii
N
knkentErr
CCkent
)(*
Prlog*Pr
17
How many Cases?
적절한 sample size complexity Prediction method 와 긴밀하게 연관 빠른 시간 안에 적절한 해
Case reduction !! Basic approach (random sampling)
– Incremental samples– Average samples
18
A Single Sample
19
Incremental Samples
20
Average Samples
추가적인 bias 없이 variance error 를 줄일 수 있음
Best Solution Approach
21
Specialized Techniques
Sequential Sampling over Time– Time-dependent data– Sampling period 와 feature measuring
사이에 최적화 Strategic sampling of Key Event
– Net change > threshold (regression) Adjusting prevalence
– Low prevalence 에 대해 case 반복