r을 이용한 통계기반 데이터 분석 - openwith.net°이터분석pv3... · uncorrelated...

111
R을 이용한 통계기반 데이터 분석 2017 윤형기 ([email protected]) Version 3 (강의용 수정)

Upload: haliem

Post on 24-Apr-2018

234 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

R을 이용한 통계기반 데이터 분석

2017

윤형기 ([email protected])

Version 3 (강의용 수정)

Page 2: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

일정표

1일차 2일차 3일차 4일차

오전

도입 빅데이터 배경/개념 빅데이터 플랫폼

데이터 분석 개념과 절차 1 CRISP-DM 분석전략 (목표와 가

설/지표체계) 분석도구

통계 기초 이론 기술통계/추론통계

데이터 수집 개요 Excel SQL/NoSQL,

분석절차 2 모델링 개요 Bias-Variance

Trade-off Resampling

통계분석 모델링 3 비선형모델 선형대수와

다변량분석 데이터 정제 및 EDA

이론 실습

기계학습3 신경망 군집화 연관분석

모델개발3 (모델평가, 성능고도화) 모델평가 모델 성능고

도화

오후

실습 환경구축 (R, RStudio) R 기초

R 데이터구조, 함수 작성

R 활용 통계분석 모델링1

통계분석 모델링 2 회귀분석 모델선정과

Regularization 시계열분석

기계학습1 KNN 의사결정트리

기계학습2 SVM Naïve Bayes

시각화 시각화

빅데이터 플랫폼 Hadoop Spark

마무리 클라우드 DL

빅데이터 개념과 분석 플랫폼 데이터 분석 개념과 모델링 통계 분석 기계학습 R 언어

2

Page 3: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

3일차

3

Page 4: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

다변량분석

4 http://www.openwith.net

여기서부터 할 것!

Page 5: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

선형대수이론 기초

5

Page 6: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

Matrix

– Square Matrix

• ; has the same number of rows as columns

– Transpose

• ; created by converting its rows into columns

6

Page 7: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• 행렬의 곱

• 항등행렬 – AI = A

• Orthogonal Matrix – A matrix A is orthogonal if AAT = ATA = I.

7

Page 8: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

벡터

• 개념 – = points ; components --> dimension

• 길이 (length)

• Vector 연산 (operation) – Addition

– Scalar Multiplication • 𝑣 = [3,6,8,4] 일 때 1.5 ∗ 𝑣 = 1.5 ∗ 3,6,8,4 = [4.5, 9, 12, 6]

– 내적 (Inner Product) • = dot product = scalar product

8

Page 9: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• Orthogonality – = perpendicular inner product = 0

• Normal Vector

• Orthonormal Vector – = Vectors of unit length that are orthogonal to each other

9

Page 10: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

Eigenvector와 Eigenvalue

• Eigenvector – = An eigenvector is a nonzero vector that satisfies

단, A = square matrix, ⃗v = eigenvector, λ = eigenvalue

10

Page 11: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• Eigenvector & eigenvalue 구하기

11

Page 12: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• Eigendecomposition – 고유값분해를 이용한 대각화 (정방행렬에 대해서만 가능)

– 대각행렬과의 행렬 곱

• SVD (특이값 분해)

12

Page 13: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

다변량 통계분석

13

Page 14: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

통계와 벡터 개념

• 원점수 벡터, 편차점수 벡터, 표준점수 벡터 – “centered” = 원점수 X에서 평균 𝑋 를 빼준 점수

– “centered & scaled”=centered 점수/표준편차(𝑠)표준점수 (𝑧)

– 표준편차의 벡터개념

• 변인 X의 표준편차 𝑠 = (𝑋−𝑋 )2

𝑁−1=

(𝑋−𝑋 )2

𝑁−1

– 분자는 편차점수 벡터의 길이 해당 변인의 variability를 반영

• 편차점수 벡터 길이와 표준편차 관계 𝑋 − 𝑋 = 𝑁 − 1 𝑠

• 즉, z 표준화는 모든 변인벡터의 길이를 𝑁 − 1로 통일시키는 것

피험자 원점수 (X) 편차점수 (X-𝑋 ) 표준점수 (z)

1 15 0 0

2 12 -3 -1

3 18 3 1

𝑋 15 0 0

s 3 3 1

인용: 박광배, 『다변량분석』, 학지사 14

Page 15: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

– 상관계수의 벡터개념

– 𝑧𝐴 = (0)2+(−1)2+(1)2 = 1.414

– 𝑧𝐵 = (0.92)2+(−1.06)2+(0.13)2= 1.414

» 즉, 두 변인의 상관계수는 r=cosθ

• 선형조합과 데이터 분석

– 투사점 (projection point) – 변인 C 즉, 선형조합축 C의 각도 조합가중치 (composite weight)

• 표준화 (standardization)

•22

22+0.82 = 0.92850.82

22+0.82 = 0.3714 이들의 제곱합 = 1

• 또한 cosθA = 0.9285, cosθB =0.3714

피험자 A B

1 15 21

2 12 16

3 18 19

피험자 ZA ZB

1 0 0.92

2 -1 -1.06

3 1 0.13

15

피험자 변인 A 변인 B 변인 C = 2A+0.8B

1 1 3 4.4

2 2 2 5.6

Page 16: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• SSCP 행렬 – Sum-of-Squares and Cross-Products

• = 𝐴′𝐴

• 𝐴′ 𝐴 𝑆𝑆𝐶𝑃

•1 2 3−4 −6 −23 9 6

1 −4 32 −6 93 −2 6

= 14 −22 39−22 56 −7839 −78 126

X’X= SSCP = =

16

Page 17: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• Variance-Covariance Matrix – Variance

– Covariance

– 예에서 • (A의 요소 – 각 열의 평균) SSCP를 구한 후

• SSCP / (A의 행의 개수) Variance-Covariance Matrix

•1

3

−1 0 10 −2 2−3 3 0

−1 0 30 −2 31 2 0

=0.667 0.667 10,667 2.667 −2

1 −2 6

• Correlation Matrix – A의 요소들을 각 열별로 표준화 SSCP를 구한 후

– (그 SSCP 행렬의 모든 요소)/(A의 행의 수) Correlation Matrix

–1

3

−1.225 0 1.2250 −1.225 1.225

−1.225 1.225 0

−1.225 0 −1.2250 −1.225 1.225

1.225 1.225 0=

1 0.5 0.50,5 1 −0.50.5 −0.5 1

17

Page 18: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

거리개념의 확장

• Distance of a point from the mean in univariate space = 𝑥𝑖 − 𝑥

• Euclidean distance

18

Page 19: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• 다변량 데이터에서의 거리측정

– (𝑥𝑖 − 𝑥 )2+(𝑦𝑖 − 𝑦 )2+⋯+ (𝑛𝑖 − 𝑛 )2

• Euclidean distance

• 정의

• 한계 – it has some degree of covariance

19

Page 20: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

거리의 개념

• Distance between numeric data points – Minkowski

– Euclidean distance.

• When p = 2,

– Manhattan distance. • When p = 1,

– Mahalanobis distance

• 기타

– Distance between categorical data points • Hamming distance, Jaccard,

– Distance between sequence (String, TimeSeries)

• 기타 관련 개념 – z-transform, Pearson,

20

Page 21: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

Variance-Covariance Matrix

21

Page 22: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

Covariance와 Distance

• It would be easier to calculate distance if we could rescale the coordinates so they didn’t have any covariance

22

Page 23: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

Distances as Vectors

• Distances in coordinate space can be described as vectors

23

Page 24: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

MVA 기법

기법 Interdependence Explanatory/Confirmatory

Factor Analysis Interdependence Explanatory

Confirmatory

MDS Interdependence Explanatory

Cluster Analysis Interdependence Explanatory

Canonical Correlation Dependence Confirmatory

SEM (Structural Equation Modeling)

Dependence Confirmatory

ANOVA Dependence Confirmatory

Discriminant Analysis Dependence Confirmatory

Logit Choice Model Dependence Confirmatory

Source: Analyzing Multivariate Data By J.M. Lattin (외)

Page 25: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

PCA, SVD, 판별분석

25

Page 26: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• 3개의 관점

– method for transforming correlated variables into a set of uncorrelated ones better expose various relationships among the original data items.

– method for identifying and ordering dimensions data points exhibit the most variation.

– method for data reduction.

• SVD의 의의

– 차원축소 expose substructure of the original data more clearly and orders it from most variation to the least.

• 대표적 활용: NLP

– 문서에 내재된 주된 관계성을 확인하면서도 특정 threshold 이하의 variation을 무시하는 대신 대대적으로 차원을 축소

Page 27: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• 방법론 – a rectangular matrix A can be broken into product of 3

matrices - an orthogonal matrix U, a diagonal matrix S, and the transpose of an orthogonal matrix V

• 단, UTU = I, V TV = I; – U 행렬의 column들은 orthonormal eigenvectors of AAT

,

– V 행렬의 column들은 orthonormal eigenvectors of ATA,

– S 는 a diagonal matrix containing square roots of eigenvalues from U or V.

– 예:

Page 28: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

특이값 분해 (Singular Value Decomposition)

• 3개의 관점

– method for transforming correlated variables into a set of uncorrelated ones better expose various relationships among the original data items.

– method for identifying and ordering dimensions data points exhibit the most variation.

– method for data reduction.

• SVD의 의의

– 차원축소 expose substructure of the original data more clearly and orders it from most variation to the least.

• 대표적 활용: NLP

– 문서에 내재된 주된 관계성을 확인하면서도 특정 threshold 이하의 variation을 무시하는 대신 대대적으로 차원을 축소

28

Page 29: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

29

Page 30: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• 방법론 – a rectangular matrix A can be broken into product of 3

matrices - an orthogonal matrix U, a diagonal matrix S, and the transpose of an orthogonal matrix V

• 단, UTU = I, V TV = I;

• U 행렬의 column들은 orthonormal eigenvectors of AAT ,

• V 행렬의 column들은 orthonormal eigenvectors of ATA,

• S 는 a diagonal matrix containing square roots of eigenvalues from U or V in descending order.

30

Page 31: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

– To find U, start with AAT

– To find eigenvalues & corresponding eigenvectors of AAT

31

Page 32: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• Eigenvectors become column vectors in a matrix ordered by the size of the corresponding eigenvalue. eigenvector for λ = 12 is column 1, and eigenvector for λ = 10 is column 2.

• convert into orthogonal matrix by Gram-Schmidt orthonormalization process to the column vectors

– Normalize 𝑣1

32

Page 33: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

– To find V

33

Page 34: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

34

Page 35: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• For λ = 12;

– …

– 𝑣1 = [1, 2, 1].

• For λ = 10;

– …

– 𝑣2 = [2, -1,0].

• For λ = 12;

– …

– 𝑣3 = [1, 2, -5].

• Gram-Schmidt orthonormalization process

35

Page 36: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• For S – we take square roots of non-zero eigenvalues and populate

the diagonal with them

– put the largest in s11, the next largest in s22 and so on until the smallest value ends up in smm.

– + add a 0-column vector to S so we can multiply between U and V .

– Diagonal entries in S are singular values of A, columns in U are left singular vectors, and columns in V are right singular vectors.

36

Page 37: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• Now we have all the pieces of the puzzle

37

Page 38: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

LDA (선형판별식)

38

Page 39: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

Frequency

table

Zero R

One R

Naive

Bayesian

Decision

tree

Covariance

matrix

LDA

Logistic

regression

Similarity

functions

KNN

Others

ANN

SVM

39

Page 40: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

LDA

• LDA: pick a new dimension that gives: – Maximum separation between means of projected classes

– Minimum variance within each projected class

• Solution: eigenvectors based on between-class and within-class covariance matrices

40

Page 41: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

PCA vs. LDA

• LDA not guaranteed to be better for classification – Assumes classes are unimodal Gaussians

– Fails when discriminatory information is not in the mean, but in the variance of the data

• Example where PCA gives better projection:

41

Page 42: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

42

Page 43: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• Modeling difference in groups for the purpose of separating 2 or more classes, objects, categories,

– much like logit, probit models

• LDA seeks to reduce dimensionality while preserving as much of the (two) class discriminatory information as possible

• (ex)

– Assume D-dimensional samples, N1 of which belong to class w1, and N2 to class w2.

– obtain a scalar y by projecting the samples x onto a line y

𝑦 = 𝑤𝑇𝑥

• Select one that maximizes the separability of the scalars

43

Page 44: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• 예 – Discriminating students in high school

• will go to college

• will go to trade school

• discontinue education

– Some pattern must be there, so we collect

• family background

• academic information

– Discriminate a person between mail or femail based on height

44

Page 45: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

Theory

• Discrimination by comparing means of variables • Can have several variables • Assumes multi variety normality - independent variables ne

ed to be continuous • Homogeneous variance • Create an equation which minimizes the possibility of miscl

assifying cases into their respective groups or categories – D = a1*X1 + a2*X2 + ... + ai*Xi + b

• where: • D = discrimination function • X = response score for that variable • a = discrimination coefficient (analogous to regression coeff) • B = constant • i = No of discriminant variables

45

Page 46: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• Based on the concept of searching for a linear combination of variables (predictors) that best separates two classes (targets)

• To capture the notion of separability, Fisher defined the following score function:

46

Page 47: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• Given the score function, the problem is to estimate the linear coefficients that maximize the score which can be solved by the following equations:

– 𝛽 = 𝐶−1 𝜇1 − 𝜇2 Model coefficients

– 𝐶 = 1

𝑛1+𝑛2 (𝑛1𝐶1 + 𝑛2𝐶2) Pooled covariance matrix

» 𝛽 ; Linear model coefficients

» 𝐶1, 𝐶2 ; Covariance matrices

» 𝜇1, 𝜇2 ; mean vectors

47

Page 48: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• Mahalanobis distance between 2 groups – A distance greater than 3 means that in two averages differ by

more than 3 standard deviations

– It means that the overlap (probability of misclassification) is quite small

– ∆2= 𝛽𝑇 𝜇1 − 𝜇2

– ∆ : Mahalanobis distance between two groups

48

Page 49: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• Finally, a new point is classified by projecting it onto the maximally separating direction and classifying it as C1 if:

– 𝛽𝑇 𝑥 − 𝜇1+𝜇2

2> log

𝑃(𝑐1)

𝑃(𝑐2)

• 𝛽 ; coefficient vector

• 𝑥 ; Data vector

• 𝜇 ; mean vector

• 𝑃(𝑐2) ; class probability

49

Page 50: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

LDA 예

• 은행에서의 고객 (중소기업) 부도위험 판별 – clients who defaulted (red square) and those that did not

(blue circles) separated by delinquent days (DAYSDELQ) and number of months in business (BUSAGE)

– LDA 이용한 판별모델 (default and non-default)

– Data

(no of observations = 100)

BUSAGE DAYSDELQ DEFAULT Z Z-Z0 Prediction

87 13 N

89 27 Y

...

50

Page 51: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• We use LDA to find an optimal linear model that best separates two classes (default and non-default)

51

Page 52: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

Z0 = .3985302 Log(P(N)/PY)) = 0.4771212547

52

Page 53: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• The first step is to calculate the mean (average) vectors, covariance matrices and class probabilities

53

Page 54: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• Then we calculate people covariance matrix and finally the coefficients of the linear model

54

Page 55: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• Assume we have a point with: BUSAGE=111 and DAYSDELQ=24

– x = [111 24]

• 𝜷𝑻 𝒙 − 𝝁𝟏+𝝁𝟐

𝟐> 𝐥𝐨𝐠

𝒑(𝒄𝟏)

𝒑(𝒄𝟐)

• 𝛽𝑇 ; 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑣𝑒𝑐𝑡𝑜𝑟

• 𝑥 ; Data vector

•𝜇1+𝜇2

2 ; mean vector

• 𝑝(𝑐1) ; class probability

•−0.0095−0.1408

[111 24][116.23 16.89]+[115.04 55.32]

2>? log

0.75

0.25

55

Page 56: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• A Mahalanobis distance of 2.32 shows a small overlap between 2 groups which means a good separation between classes by the linear model – ∆2= 𝛽𝑇 𝜇1 − 𝜇2 = 5.40

– ∆ = 2..32

56

Page 57: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

PCA

57

Page 58: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

1. 개념

• 의의 – PCA: orthogonal projection of highly correlated variables to

principal components

• linear transformation is defined in such a way that the first principal component has the largest possible variance.

• PC: a set of values of linearly uncorrelated variables

• 활용 – High-dimensional data

– 이미지 처리, Text 처리, 주식정보, …

– describe them in a simpler way

Page 59: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

Covariance

Page 60: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• For 2 dimensional data, cov(x,y)

• For 3 dimensional data, cov(x,y), cov(x,z), cov(y,z)

• For an n-dimensional data set, 𝑛

𝑛−2 !∗2 different

covariance values

• So, the definition for the covariance matrix for a set of data with dimensions is:

Page 61: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• Eigenvector? – non-0 vector that, after being multiplied by the matrix, remain parallel to origi

nal vector.

• In order to keep eigenvectors standard, we usually scale it to make it have a length of 1, so that all eigenvectors have the same length.

non-eigenvector eigenvector

Page 62: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• 진행절차 – Step 1: 데이터 입수 및 정비

• Subtract the mean & covariance matrix 계산

– Step 4: covariance matrix 의 eigenvector와 eigenvalues 계산

– Step 5: components 선택 및 feature vector 생성

– Step 6: 새로운 데이터 셋 도출

Page 63: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• components 선택 및 feature vector 생성

• eigenvector with the highest eigenvalue is principle component of the data set.

• 나머지 생략 가능…

Page 64: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• Step 6: 새로운 데이터 셋 도출

– RowFeatureVector = matrix with the eigenvectors in the columns transposed, with the most significant eigenvector at the top.

– RowDataAdjust = mean-adjusted data transposed, ie. the data items are in each column, with each row holding a separate dimension.

– original data를 우리가 선택한 vector에 의거하여 변형

– the patterns are the lines that most closely describe the relationships between the data.

• Getting the old data back

Page 65: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• Biplot – shows the proportions of each variable along the 2 PCs

• Spree

Page 66: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

EDA

66

Page 67: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

EDA (탐색적 데이터 분석)

• 주된 용도 – 1. 데이터셋에 대한 insight. – 2. 데이터에 영향주는 요소 (Understand some critical impact

variable)을 확인하고 그 관계를 이해함 – 3. Outlier 존재 여부를 확인 – 4. 데이터셋에 내재하는 전제조건 (underlying assumptions)을 검증

• 데이터 분석 – 탐색 vs. 확인 – Confirmatory data analysis

• tests a hypothesis • settles questions • (Inferential statistics)

– Exploratory data analysis • finds a good description • raises new questions • (Descriptive statistics)

67

Page 68: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• Exploratory Data Analysis (EDA) – an approach/philosophy for data analysis that employs a

variety of techniques (mostly graphical) to

• maximize insight into a data set;

• uncover underlying structure;

• extract important variables;

• detect outliers and anomalies;

• test underlying assumptions;

• develop parsimonious models; and

• determine optimal factor settings.

Page 69: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

3가지 접근법

• For classical analysis, the sequence is – Problem => Data => Model => Analysis => Conclusions

• For EDA, the sequence is – Problem => Data => Analysis => Model => Conclusions

• For Bayesian, the sequence is – Problem => Data => Model => Prior Distribution => Analysis

=> Conclusions

Page 70: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

70

Page 71: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• 데이터분석 절차 – 시각화 및 EDA

71

그림출처: Wickham and Grolemund

Page 72: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• Data Munging – Transforming data

– Raw data to usable data

– Data must be cleaned first

• 주요 Tasks – Renaming variables

– Data type conversion

– Encoding, decoding or recoding data

– Merging data sets

– Transforming data

– Handling missing data (imputing)

– Handling anomalous values

72

Page 73: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

1. 기술적 측면

• (1) 데이터 읽기 – read.table

• read.delim read.delim2 read.csv • read.csv2 read.table read.fwf

– A freshly read data.frame should always be inspected with functions like head, str, and summary

• (2) 타입 변환 – coercion

• as.numeric as.logical as.integer • as.factor as.character as.ordered

– factor 변환 • factor()

– date 변환 • library(lubridate)

• (3) 문자열과 encoding – Sys.getlocale("LC_CTYPE") – f <- file("myUTF16file.txt", encoding = "UTF-16")

Page 74: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

2. Consistent Data

• (1) Missing value – na.rm = TRUE – (persons_complete <- na.omit(person))

• (2) special value 문제 – (예) – is.special <- function(x){ – if (is.numeric(x)) !is.finite(x) else is.na(x) – }

• (3) Outlier 문제

Page 75: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

3. 수정

• 대체값 적용 (Imputation) – x <- 1:5 # create a vector... – x[2] <- NA # ...with an empty value – x <- impute(x, mean) – x – ## 1 2 3 4 5 – ## 1.00 3.25* 3.00 4.00 5.00 – is.imputed(x)

– # -- – I <- is.na(x) – R <- sum(x[!I])/sum(y[!I]) – x[I] <- R * y[I]

– # -- – data(iris) – iris$Sepal.Length[1:10] <- NA – model <- lm(Sepal.Length ~ Sepal.Width + Petal.Width, data = iris) – I <- is.na(iris$Sepal.Length) – iris$Sepal.Length[I] <- predict(model, newdata = iris[I, ])

Page 76: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

dplyr 기초

• 6가지의 주된 함수 – Pick observations by their values (filter()). – Reorder the rows (arrange()). – Pick variables by their names (select()). – Create new variables with functions of existing variables (mutate()). – Collapse many values down to a single summary (summarise()). – + group_by()

• changes the scope of each function from operating on the entire dataset to operating on it group-by-group.

• 사용법

– The first argument is a data frame. – The subsequent arguments describe what to do with the data

frame, using the variable names (without quotes). – The result is a new data frame.

Page 77: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• dplyr 의 filter에서의 logical operation

Page 78: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

Tidy data set

Page 79: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• rules for a tidy dataset : – Each variable must have its own column.

– Each observation must have its own row.

– Each value must have its own cell.

Page 80: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

> table4a # A tibble: 3 × 3 country `1999` `2000` * <chr> <int> <int> 1 Afghanistan 745 2666 2 Brazil 37737 80488 3 China 212258 213766

> table4a %>% + gather(`1999`, `2000`, key = "year", value = "cases") # A tibble: 6 × 3 country year cases <chr> <chr> <int> 1 Afghanistan 1999 745 2 Brazil 1999 37737 3 China 1999 212258 4 Afghanistan 2000 2666 5 Brazil 2000 80488 6 China 2000 213766

Page 81: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

Relational data

• A primary key – uniquely identifies an observation in its own table. – (ex) planes$tailnum is a primary key

• A foreign key – uniquely identifies an observation in another table. – (ex) flights$tailnum is a foreign key

Page 82: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

시각화

(v.0.9) 82

Page 83: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

개요

• Base Graphics – plot()

• hist() and boxplot().

– points(), lines(), text(), mtext(), axis(), rug(), identify()

• 특화 패키지: – http://www.comput

erworld.com/article/2921176/business-intelligence/great-r-packages-for-data-import-wrangling-visualization.html

Page 84: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

Lattice Graphics

• Lattice = a flavour of trellis graphics – For lattice, graphics formulae are mandatory.

– grid = a low-level graphics system. It was used to build lattice.

• Lattice vs. base graphics – xyplot() vs. plot() – plot() gives a graph as a side effect of the command.

– xyplot() generates a graphics object.

• As this is output to the command line, the object is “printed”, i.e., a graph appears.

84

Page 85: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

graph_type 설명 formula 예

barchart bar chart x~A or A~x

bwplot boxplot x~A or A~x

cloud 3D scatterplot z~x*y|A

contourplot 3D contour plot z~x*y

densityplot kernal density plot ~x|A*B

dotplot dotplot ~x|A

histogram histogram ~x

levelplot 3D level plot z~y*x

parallel parallel coordinates plot data frame

splom scatterplot matrix data frame

stripplot strip plots A~x or x~A

xyplot scatterplot y~x|A

wireframe 3D wireframe graph z~y*x

85

Page 86: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

ggplot2

• ggplot2 의 특징 – (장점) Consistent underlying grammar of graphics – (한계) Things you cannot do:

• 3-dimensional graphics • Graph-theory type graphs (nodes/edges layout)

• Anatomy of a plot: – data aesthetic mapping – geometric object statistical transformations – scales coordinate system – position adjustments faceting

• ggplot2 vs. Base Graphics – ggplot2 is more verbose for simple / canned graphics – is less verbose for complex / custom graphics – does not have methods (data should always be in a data.frame) – uses a diferent system for adding plot elements

Page 87: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• Geometric Objects And Aesthetics – Aesthetic Mapping

• ggplot 에서 aesthetic 이란 "something you can see" – position (i.e., on the x and y axes)

– color ("outside" color)

– fill ("inside" color)

– shape (of points)

– linetype

– size

• > aes()

– Geometric Objects • = actual marks we put on a plot

– points (geom_point, for scatter plots, dot plots, etc)

– lines (geom_line, for time series, trend lines, etc)

– boxplot (geom_boxplot, for, well, boxplots!)

Page 88: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

기계학습 모델링

88

Page 89: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

Machine Learning?

• 개념 – = subfield of Artificial Intelligence (AI)

– “construction and study of systems that can learn from data”

• 종류 – http://en.wikipedia.org/wiki/Machine_learning

• 방법론 – “A computer program learns from experience (E) with some cl

ass of tasks (T) and a performance measure (P) if its performance at tasks in T as measured by P improves with E”

89

Page 90: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• 용어 – Features

• = distinct traits that can be used to describe each item in a quantitative manner.

– Samples • an item to process (e.g. classify). • document, a picture, a sound, a video, a row in database or CSV file,

– Feature vector • an n-dimensional vector of numerical features that represent some o

bject.

– Feature extraction • Preparation of feature vector – transforms the data in the high-dimen

sional space to a space of fewer dimensions. •

– Training/Evolution set • Set of data to discover potentially predictive relationships.

90

Page 91: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

학습 (Learning) vs. 훈련 (Training )

91

Page 92: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

절차

92

Page 93: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

종류

• Supervised Learning

• Unsupervised Learning

• Semi-Supervised Learning

• Reinforcement Learning – allows the machine or software agent to learn its behavior bas

ed on feedback from the environment.

– This behavior can be learnt once and for all, or keep on adapting as time goes by.

93

Page 94: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

ML Algorithm의 유형

• Predictive model – = target 변수와 다른 feature들 사이의 관계를 발견 또는 모델링

하려는 것

– = supervised learning clear instruction on what they learn & how (단, 사람 아닌 target values provide a supervisory role to find …)

• Descriptive model – Nor target to learn No single feature is more important tha

n other

– (ex) Pattern discovery (Market basket analysis, clustering)

94

Page 95: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

Supervised L Unsupervised L Other Applications Remarks

NN

(Naïve) Bayes

Decision Tree

(Classif’n Rule L)

Linear Regression

Model Tree

Neural Net

SVM

AR

K-means

..

Marketing Anal.

… 95

Page 96: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

KNN

96

Page 97: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

KNN

• = classify unlabeled examples by assigning them to the class of the most similar labeled example

• [사례] Blind testing을 통한 tomato 분류배정

97

Page 98: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

98

Page 99: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

99

Page 100: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

100

Page 101: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• 거리계산 – Euclidean distance

– Manhattan D

• NN – 1NN ; orange이므로 as fruit

– 3NN ; vote among the 3 nearest neighbor

• 적절한 K 값의 선택

Large K Small K

Bias 감소 Variance 감소

But Underfitting But Overfitting

Single K outlier

실무: 학습대상 concept의 복잡성, training data의 개수 101

Page 102: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

절차

• Data 준비 – 사전준비 – transform features to a standard range

• Shrinking, rescaling min-max normalization

• Z-score standardization

– Nominal feature의 경우 dummy coding • 단, Ordinal data의 경우 number 부여 후 normalize

• 특징 – Lazy Learning No abstraction, No generalization

– 대신, instance-based Learning

– Non-parametric Learning

102

Page 103: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

응용

• Voronoi Diagram – Training example에 의거한 Decision surface

103

Page 104: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

BAYESIAN과 NAÏVE BAYES

104

Page 105: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

기본 개념

• 배경: – the estimated likelihood of an event should be based on the e

vidence at hand.

• 확률

105

Page 106: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• 조건부 확률 - Bayes’ theorem

– 사전확률 (prior probability)

• the most reasonable guess would be the probability that any prior message was spam (~ 20 % in the example).

– likelihood

• Probability that the word Viagra was used in previous spam messages

– Marginal likelihood

• Probability that Viagra appeared in any message at all.

• 사후확률 (posterior probability) – Bayes' theorem을 이용해서 메시지가 spam일 사후확률 계산

– IF ( > 50 %) THEN message is likely to be spam should be filtered.

106

Page 107: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• Bayes’ theorem의 적용

• Frequency table

• P(spam∩Viagra) = P(Viagram|spam) * P(spam) = (4/20)((20/100)=0.04

• P(spam|Viagra) = P(Viagra|spam) * P(spam) / P(Viagra) = (4/20)*(20/100)/(5/100) = 0.80

107

Page 108: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

Naïve Bayes 분류

• spam main 예의 확장 – Train by constructing likelihood table for the appearance of 4

words:

– 확률계산 (ex) Viagra=Y, Money=N, Groceries=N,

Unsubscribe=Y

– Class-conditional independence를 이용하여 계산 단순화

108

Page 109: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

– likelihood of ham: • (4/20) * (10/20) * (20/20) * (12/20) * (20/100) = 0.012

• Spam일 확률: 0.012 / (0.012 + 0.002) = 0.857

– likelihood of ham : • (1/80) * (66/80) * (71/80) * (23/80) * (80/100) = 0.002

• Ham일 확률: 0.002 / (0.012 + 0.002) = 0.143

– => expect that the message is spam (85.7 %), ham with 14.3 %. • 즉, “this message is 6 times more likely to be spam than ham.”

– probability of level L for class C, given the evidence provided by features F1 ~ Fn, is:

109

Page 110: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

• Laplace estimator – (IF) message contains: Viagra, Groceries, Money, Unsubscribe.

• naive Bayes algorithm 에서의 likelihood of spam:

– (4/20) * (10/20) * (0/20) * (12/20) * (20/100) = 0

• And the likelihood of ham is:

– (1/80) * (14/80) * (8/80) * (23/80) * (80/100) = 0.00005

• probability of spam is: 0 / (0 + 0.0099) = 0, probability of ham is: 0.00005 / (0 + 0. 0.00005) = 1

– (Solution) Laplace estimator

• frequency table의 count에 숫자 (1) 가산 ensure that each feature has a nonzero probability of occurring with each class.

• Naïve Bayes에서 numeric feature 사용 – By discretizing/binning

110

Page 111: R을 이용한 통계기반 데이터 분석 - openwith.net°이터분석Pv3... · uncorrelated ones better expose various relationships among the ... • Now we have all the pieces

사례 – 휴대폰 spam filtering

• SMS에서의 spam filter

• 데이터: sms_spam.csv

• Package: tm 이용 – Corpus() corpus (= a collection of text document) 생성

– VectorSource() tell Corpus() to use the message in the vector sms_train$text

• ** tm package **

111