1 4 data reduction 응용화학부 송상옥. 2 발표순서 o data reduction 의 필요성 o...

1

4Data Reduction

응용화학부송상옥

2

발표순서

Data Reduction 의 필요성 Dimension Reduction 의 역할 및 형태 Dimension Reduction 의 구체적 방법

3

왜 필요한가 ?

데이터가 너무 많으면– 예측 프로그램의 용량 초과– 해를 구하는데 걸리는 시간 지연

적절한 양의 데이터– 데이터에 포함된 개념의 복잡도에 의존

(model 의 complexity)– mining 이전에 알 수 없다 .– Ex) random data

4

Dimension Reduction 의 역할

5

Dimension Reduction 의 형태 Delete a column (feature) Delete a row (case) Reduce the number of values in a

column (smooth a feature)

transformation to new data set(PCA)

6

Best Features Selection

Impossible !– Search space– computational time

approximation– promising subsets– simple distance

measure– using only training

error

7

Mean and Variance

Cases : a sample from some dist. Spreadsheet mean and variance BUT, Dist. is unknown

Heuristic Feature Selection Guidance

8

Independent Features

Classification problem

k classes classification– k pairwise comparison

Regression = pseudo-classification

sig

BAse

BmeanAmean

n

B

n

ABAse

21

varvar

9

Distance Based Selection

Independent analysis + correlation analysis detect redundancy

Distance measure

– Independent feature

Branch-and-Bound Algorithm

TM MMCCMMD 211

2121

iiimim 212

21 varvar

iFDFD MM ,

10

Heuristic Feature Selection

Comparison measures– Significant Test

– Dm

– F-Test

11

Principal Components

Merging features– a new set of fewer columns

first k-component First principal component

– minimum euclidean distance Feature with a large variance

– excellent chances for separation of class or group of case values

SPS

12

Decision Trees

Dynamic logic approach– coordinated with searching for

solution advantageous in large feature

spaces recursive partitioning

13

Reducing Values Problem

Clustering problem

14

Rounding

k

kk

k

iyix

iyiythenixif

ixiy

10

121010,mod

)10int(

15

K-Mean Clustering

16

Class Entropy

k

iii

N

knkentErr

CCkent

)(*

Prlog*Pr

17

How many Cases?

적절한 sample size complexity Prediction method 와 긴밀하게 연관 빠른 시간 안에 적절한 해

Case reduction !! Basic approach (random sampling)

– Incremental samples– Average samples

18

A Single Sample

19

Incremental Samples

20

Average Samples

추가적인 bias 없이 variance error 를 줄일 수 있음

Best Solution Approach

21

Specialized Techniques

Sequential Sampling over Time– Time-dependent data– Sampling period 와 feature measuring

사이에 최적화 Strategic sampling of Key Event

– Net change > threshold (regression) Adjusting prevalence

– Low prevalence 에 대해 case 반복

1 4 data reduction 응용화학부 송상옥. 2 발표순서 o data reduction 의 필요성 o...

Documents