r을 이용한 통계기반 데이터 분석 - openwith.net°이터분석pv3... · uncorrelated...

R을 이용한 통계기반 데이터 분석

2017

윤형기 ([email protected])

Version 3 (강의용 수정)

일정표

1일차 2일차 3일차 4일차

오전

도입 빅데이터 배경/개념 빅데이터 플랫폼

데이터 분석 개념과 절차 1 CRISP-DM 분석전략 (목표와 가

설/지표체계) 분석도구

통계 기초 이론 기술통계/추론통계

데이터 수집 개요 Excel SQL/NoSQL,

분석절차 2 모델링 개요 Bias-Variance

Trade-off Resampling

통계분석 모델링 3 비선형모델 선형대수와

다변량분석 데이터 정제 및 EDA

이론 실습

기계학습3 신경망 군집화 연관분석

모델개발3 (모델평가, 성능고도화) 모델평가 모델 성능고

도화

오후

실습 환경구축 (R, RStudio) R 기초

R 데이터구조, 함수 작성

R 활용 통계분석 모델링1

통계분석 모델링 2 회귀분석 모델선정과

Regularization 시계열분석

기계학습1 KNN 의사결정트리

기계학습2 SVM Naïve Bayes

시각화 시각화

빅데이터 플랫폼 Hadoop Spark

마무리 클라우드 DL

빅데이터 개념과 분석 플랫폼 데이터 분석 개념과 모델링 통계 분석 기계학습 R 언어

2

3일차

3

다변량분석

4 http://www.openwith.net

여기서부터 할 것!

http://www.openwith.net/

선형대수이론 기초

5

Matrix

– Square Matrix

• ; has the same number of rows as columns

– Transpose

• ; created by converting its rows into columns

6

• 행렬의 곱

• 항등행렬 – AI = A

• Orthogonal Matrix – A matrix A is orthogonal if AAT = ATA = I.

7

벡터

• 개념 – = points ; components --> dimension

• 길이 (length)

• Vector 연산 (operation) – Addition

– Scalar Multiplication • 𝑣 = [3,6,8,4] 일 때 1.5 ∗ 𝑣 = 1.5 ∗ 3,6,8,4 = [4.5, 9, 12, 6]

– 내적 (Inner Product) • = dot product = scalar product

8

• Orthogonality – = perpendicular inner product = 0

• Normal Vector

• Orthonormal Vector – = Vectors of unit length that are orthogonal to each other

9

Eigenvector와 Eigenvalue

• Eigenvector – = An eigenvector is a nonzero vector that satisfies

단, A = square matrix, ⃗v = eigenvector, λ = eigenvalue

10

• Eigenvector & eigenvalue 구하기

11

• Eigendecomposition – 고유값분해를 이용한 대각화 (정방행렬에 대해서만 가능)

– 대각행렬과의 행렬 곱

• SVD (특이값 분해)

12

다변량 통계분석

13

통계와 벡터 개념

• 원점수 벡터, 편차점수 벡터, 표준점수 벡터 – “centered” = 원점수 X에서 평균 𝑋 를 빼준 점수

– “centered & scaled”=centered 점수/표준편차(𝑠)표준점수 (𝑧)

– 표준편차의 벡터개념

• 변인 X의 표준편차 𝑠 = (𝑋−𝑋 )2

𝑁−1=

(𝑋−𝑋 )2

𝑁−1

– 분자는 편차점수 벡터의 길이 해당 변인의 variability를 반영

• 편차점수 벡터 길이와 표준편차 관계 𝑋 − 𝑋 = 𝑁 − 1 𝑠

• 즉, z 표준화는 모든 변인벡터의 길이를 𝑁 − 1로 통일시키는 것

피험자 원점수 (X) 편차점수 (X-𝑋 ) 표준점수 (z)

1 15 0 0

2 12 -3 -1

3 18 3 1

𝑋 15 0 0

s 3 3 1

인용: 박광배, 『다변량분석』, 학지사 14

– 상관계수의 벡터개념

– 𝑧𝐴 = (0)2+(−1)2+(1)2 = 1.414

– 𝑧𝐵 = (0.92)2+(−1.06)2+(0.13)2= 1.414

» 즉, 두 변인의 상관계수는 r=cosθ

• 선형조합과 데이터 분석

– 투사점 (projection point) – 변인 C 즉, 선형조합축 C의 각도 조합가중치 (composite weight)

• 표준화 (standardization)

•22

22+0.82 = 0.92850.82

22+0.82 = 0.3714 이들의 제곱합 = 1

• 또한 cosθA = 0.9285, cosθB =0.3714

피험자 A B

1 15 21

2 12 16

3 18 19

피험자 ZA ZB

1 0 0.92

2 -1 -1.06

3 1 0.13

15

피험자 변인 A 변인 B 변인 C = 2A+0.8B

1 1 3 4.4

2 2 2 5.6

• SSCP 행렬 – Sum-of-Squares and Cross-Products

• = 𝐴′𝐴

• 𝐴′ 𝐴 𝑆𝑆𝐶𝑃

•1 2 3−4 −6 −23 9 6

1 −4 32 −6 93 −2 6

= 14 −22 39−22 56 −7839 −78 126

X’X= SSCP = =

16

• Variance-Covariance Matrix – Variance

– Covariance

– 예에서 • (A의 요소 – 각 열의 평균) SSCP를 구한 후

• SSCP / (A의 행의 개수) Variance-Covariance Matrix

•1

3

−1 0 10 −2 2−3 3 0

−1 0 30 −2 31 2 0

=0.667 0.667 10,667 2.667 −2

1 −2 6

• Correlation Matrix – A의 요소들을 각 열별로 표준화 SSCP를 구한 후

– (그 SSCP 행렬의 모든 요소)/(A의 행의 수) Correlation Matrix

–1

3

−1.225 0 1.2250 −1.225 1.225

−1.225 1.225 0

−1.225 0 −1.2250 −1.225 1.225

1.225 1.225 0=

1 0.5 0.50,5 1 −0.50.5 −0.5 1

17

거리개념의 확장

• Distance of a point from the mean in univariate space = 𝑥𝑖 − 𝑥

• Euclidean distance

18

• 다변량 데이터에서의 거리측정

– (𝑥𝑖 − 𝑥 )2+(𝑦𝑖 − 𝑦 )2+⋯+ (𝑛𝑖 − 𝑛 )2

• Euclidean distance

• 정의

• 한계 – it has some degree of covariance

19

거리의 개념

• Distance between numeric data points – Minkowski

– Euclidean distance.

• When p = 2,

– Manhattan distance. • When p = 1,

– Mahalanobis distance

• 기타

– Distance between categorical data points • Hamming distance, Jaccard,

– Distance between sequence (String, TimeSeries)

• 기타 관련 개념 – z-transform, Pearson,

20

Variance-Covariance Matrix

21

Covariance와 Distance

• It would be easier to calculate distance if we could rescale the coordinates so they didn’t have any covariance

22

Distances as Vectors

• Distances in coordinate space can be described as vectors

23

MVA 기법

기법 Interdependence Explanatory/Confirmatory

Factor Analysis Interdependence Explanatory

Confirmatory

MDS Interdependence Explanatory

Cluster Analysis Interdependence Explanatory

Canonical Correlation Dependence Confirmatory

SEM (Structural Equation Modeling)

Dependence Confirmatory

ANOVA Dependence Confirmatory

Discriminant Analysis Dependence Confirmatory

Logit Choice Model Dependence Confirmatory

Source: Analyzing Multivariate Data By J.M. Lattin (외)

PCA, SVD, 판별분석

25

• 3개의 관점

– method for transforming correlated variables into a set of uncorrelated ones better expose various relationships among the original data items.

– method for identifying and ordering dimensions data points exhibit the most variation.

– method for data reduction.

• SVD의 의의

– 차원축소 expose substructure of the original data more clearly and orders it from most variation to the least.

• 대표적 활용: NLP

– 문서에 내재된 주된 관계성을 확인하면서도 특정 threshold 이하의 variation을 무시하는 대신 대대적으로 차원을 축소

• 방법론 – a rectangular matrix A can be broken into product of 3

matrices - an orthogonal matrix U, a diagonal matrix S, and the transpose of an orthogonal matrix V

• 단, UTU = I, V TV = I; – U 행렬의 column들은 orthonormal eigenvectors of AAT

,

– V 행렬의 column들은 orthonormal eigenvectors of ATA,

– S 는 a diagonal matrix containing square roots of eigenvalues from U or V.

– 예:

특이값 분해 (Singular Value Decomposition)

• 3개의 관점

– method for transforming correlated variables into a set of uncorrelated ones better expose various relationships among the original data items.

– method for identifying and ordering dimensions data points exhibit the most variation.

– method for data reduction.

• SVD의 의의

– 차원축소 expose substructure of the original data more clearly and orders it from most variation to the least.

• 대표적 활용: NLP

– 문서에 내재된 주된 관계성을 확인하면서도 특정 threshold 이하의 variation을 무시하는 대신 대대적으로 차원을 축소

28

• 방법론 – a rectangular matrix A can be broken into product of 3

matrices - an orthogonal matrix U, a diagonal matrix S, and the transpose of an orthogonal matrix V

• 단, UTU = I, V TV = I;

• U 행렬의 column들은 orthonormal eigenvectors of AAT ,

• V 행렬의 column들은 orthonormal eigenvectors of ATA,

• S 는 a diagonal matrix containing square roots of eigenvalues from U or V in descending order.

30

– To find U, start with AAT

– To find eigenvalues & corresponding eigenvectors of AAT

31

• Eigenvectors become column vectors in a matrix ordered by the size of the corresponding eigenvalue. eigenvector for λ = 12 is column 1, and eigenvector for λ = 10 is column 2.

• convert into orthogonal matrix by Gram-Schmidt orthonormalization process to the column vectors

– Normalize 𝑣1

32

– To find V

33

• For λ = 12;

– …

– 𝑣1 = [1, 2, 1].

• For λ = 10;

– …

– 𝑣2 = [2, -1,0].

• For λ = 12;

– …

– 𝑣3 = [1, 2, -5].

• Gram-Schmidt orthonormalization process

35

• For S – we take square roots of non-zero eigenvalues and populate

the diagonal with them

– put the largest in s11, the next largest in s22 and so on until the smallest value ends up in smm.

– + add a 0-column vector to S so we can multiply between U and V .

– Diagonal entries in S are singular values of A, columns in U are left singular vectors, and columns in V are right singular vectors.

36

• Now we have all the pieces of the puzzle

37

LDA (선형판별식)

38

Frequency

table

Zero R

One R

Naive

Bayesian

Decision

tree

Covariance

matrix

LDA

Logistic

regression

Similarity

functions

KNN

Others

ANN

SVM

39

LDA

• LDA: pick a new dimension that gives: – Maximum separation between means of projected classes

– Minimum variance within each projected class

• Solution: eigenvectors based on between-class and within-class covariance matrices

40

PCA vs. LDA

• LDA not guaranteed to be better for classification – Assumes classes are unimodal Gaussians

– Fails when discriminatory information is not in the mean, but in the variance of the data

• Example where PCA gives better projection:

41

• Modeling difference in groups for the purpose of separating 2 or more classes, objects, categories,

– much like logit, probit models

• LDA seeks to reduce dimensionality while preserving as much of the (two) class discriminatory information as possible

• (ex)

– Assume D-dimensional samples, N1 of which belong to class w1, and N2 to class w2.

– obtain a scalar y by projecting the samples x onto a line y

𝑦 = 𝑤𝑇𝑥

• Select one that maximizes the separability of the scalars

43

• 예 – Discriminating students in high school

• will go to college

• will go to trade school

• discontinue education

– Some pattern must be there, so we collect

• family background

• academic information

– Discriminate a person between mail or femail based on height

44

Theory

• Discrimination by comparing means of variables • Can have several variables • Assumes multi variety normality - independent variables ne

ed to be continuous • Homogeneous variance • Create an equation which minimizes the possibility of miscl

assifying cases into their respective groups or categories – D = a1*X1 + a2*X2 + ... + ai*Xi + b

• where: • D = discrimination function • X = response score for that variable • a = discrimination coefficient (analogous to regression coeff) • B = constant • i = No of discriminant variables

45

• Based on the concept of searching for a linear combination of variables (predictors) that best separates two classes (targets)

• To capture the notion of separability, Fisher defined the following score function:

46

• Given the score function, the problem is to estimate the linear coefficients that maximize the score which can be solved by the following equations:

– 𝛽 = 𝐶−1 𝜇1 − 𝜇2 Model coefficients

– 𝐶 = 1

𝑛1+𝑛2 (𝑛1𝐶1 + 𝑛2𝐶2) Pooled covariance matrix

» 𝛽 ; Linear model coefficients

» 𝐶1, 𝐶2 ; Covariance matrices

» 𝜇1, 𝜇2 ; mean vectors

47

• Mahalanobis distance between 2 groups – A distance greater than 3 means that in two averages differ by

more than 3 standard deviations

– It means that the overlap (probability of misclassification) is quite small

– ∆2= 𝛽𝑇 𝜇1 − 𝜇2

– ∆ : Mahalanobis distance between two groups

48

• Finally, a new point is classified by projecting it onto the maximally separating direction and classifying it as C1 if:

– 𝛽𝑇 𝑥 − 𝜇1+𝜇2

2> log

𝑃(𝑐1)

𝑃(𝑐2)

• 𝛽 ; coefficient vector

• 𝑥 ; Data vector

• 𝜇 ; mean vector

• 𝑃(𝑐2) ; class probability

49

LDA 예

• 은행에서의 고객 (중소기업) 부도위험 판별 – clients who defaulted (red square) and those that did not

(blue circles) separated by delinquent days (DAYSDELQ) and number of months in business (BUSAGE)

– LDA 이용한 판별모델 (default and non-default)

– Data

(no of observations = 100)

BUSAGE DAYSDELQ DEFAULT Z Z-Z0 Prediction

87 13 N

89 27 Y

...

50

• We use LDA to find an optimal linear model that best separates two classes (default and non-default)

51

Z0 = .3985302 Log(P(N)/PY)) = 0.4771212547

52

• The first step is to calculate the mean (average) vectors, covariance matrices and class probabilities

53

• Then we calculate people covariance matrix and finally the coefficients of the linear model

54

• Assume we have a point with: BUSAGE=111 and DAYSDELQ=24

– x = [111 24]

• 𝜷𝑻 𝒙 − 𝝁𝟏+𝝁𝟐

𝟐> 𝐥𝐨𝐠

𝒑(𝒄𝟏)

𝒑(𝒄𝟐)

• 𝛽𝑇 ; 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑣𝑒𝑐𝑡𝑜𝑟

• 𝑥 ; Data vector

•𝜇1+𝜇2

2 ; mean vector

• 𝑝(𝑐1) ; class probability

•−0.0095−0.1408

[111 24][116.23 16.89]+[115.04 55.32]

2>? log

0.75

0.25

55

• A Mahalanobis distance of 2.32 shows a small overlap between 2 groups which means a good separation between classes by the linear model – ∆2= 𝛽𝑇 𝜇1 − 𝜇2 = 5.40

– ∆ = 2..32

56

PCA

57

1. 개념

• 의의 – PCA: orthogonal projection of highly correlated variables to

principal components

• linear transformation is defined in such a way that the first principal component has the largest possible variance.

• PC: a set of values of linearly uncorrelated variables

• 활용 – High-dimensional data

– 이미지 처리, Text 처리, 주식정보, …

– describe them in a simpler way

Covariance

• For 2 dimensional data, cov(x,y)

• For 3 dimensional data, cov(x,y), cov(x,z), cov(y,z)

• For an n-dimensional data set, 𝑛

𝑛−2 !∗2 different

covariance values

• So, the definition for the covariance matrix for a set of data with dimensions is:

• Eigenvector? – non-0 vector that, after being multiplied by the matrix, remain parallel to origi

nal vector.

• In order to keep eigenvectors standard, we usually scale it to make it have a length of 1, so that all eigenvectors have the same length.

non-eigenvector eigenvector

• 진행절차 – Step 1: 데이터 입수 및 정비

• Subtract the mean & covariance matrix 계산

– Step 4: covariance matrix 의 eigenvector와 eigenvalues 계산

– Step 5: components 선택 및 feature vector 생성

– Step 6: 새로운 데이터 셋 도출

• components 선택 및 feature vector 생성

• eigenvector with the highest eigenvalue is principle component of the data set.

• 나머지 생략 가능…

• Step 6: 새로운 데이터 셋 도출

– RowFeatureVector = matrix with the eigenvectors in the columns transposed, with the most significant eigenvector at the top.

– RowDataAdjust = mean-adjusted data transposed, ie. the data items are in each column, with each row holding a separate dimension.

– original data를 우리가 선택한 vector에 의거하여 변형

– the patterns are the lines that most closely describe the relationships between the data.

• Getting the old data back

• Biplot – shows the proportions of each variable along the 2 PCs

• Spree

EDA

66

EDA (탐색적 데이터 분석)

• 주된 용도 – 1. 데이터셋에 대한 insight. – 2. 데이터에 영향주는 요소 (Understand some critical impact

variable)을 확인하고 그 관계를 이해함 – 3. Outlier 존재 여부를 확인 – 4. 데이터셋에 내재하는 전제조건 (underlying assumptions)을 검증

• 데이터 분석 – 탐색 vs. 확인 – Confirmatory data analysis

• tests a hypothesis • settles questions • (Inferential statistics)

– Exploratory data analysis • finds a good description • raises new questions • (Descriptive statistics)

67

• Exploratory Data Analysis (EDA) – an approach/philosophy for data analysis that employs a

variety of techniques (mostly graphical) to

• maximize insight into a data set;

• uncover underlying structure;

• extract important variables;

• detect outliers and anomalies;

• test underlying assumptions;

• develop parsimonious models; and

• determine optimal factor settings.

3가지 접근법

• For classical analysis, the sequence is – Problem => Data => Model => Analysis => Conclusions

• For EDA, the sequence is – Problem => Data => Analysis => Model => Conclusions

• For Bayesian, the sequence is – Problem => Data => Model => Prior Distribution => Analysis

=> Conclusions

• 데이터분석 절차 – 시각화 및 EDA

71

그림출처: Wickham and Grolemund

• Data Munging – Transforming data

– Raw data to usable data

– Data must be cleaned first

• 주요 Tasks – Renaming variables

– Data type conversion

– Encoding, decoding or recoding data

– Merging data sets

– Transforming data

– Handling missing data (imputing)

– Handling anomalous values

72

1. 기술적 측면

• (1) 데이터 읽기 – read.table

• read.delim read.delim2 read.csv • read.csv2 read.table read.fwf

– A freshly read data.frame should always be inspected with functions like head, str, and summary

• (2) 타입 변환 – coercion

• as.numeric as.logical as.integer • as.factor as.character as.ordered

– factor 변환 • factor()

– date 변환 • library(lubridate)

• (3) 문자열과 encoding – Sys.getlocale("LC_CTYPE") – f <- file("myUTF16file.txt", encoding = "UTF-16")

2. Consistent Data

• (1) Missing value – na.rm = TRUE – (persons_complete <- na.omit(person))

• (2) special value 문제 – (예) – is.special <- function(x){ – if (is.numeric(x)) !is.finite(x) else is.na(x) – }

• (3) Outlier 문제

3. 수정

• 대체값 적용 (Imputation) – x <- 1:5 # create a vector... – x[2] <- NA # ...with an empty value – x <- impute(x, mean) – x – ## 1 2 3 4 5 – ## 1.00 3.25* 3.00 4.00 5.00 – is.imputed(x)

– # -- – I <- is.na(x) – R <- sum(x[!I])/sum(y[!I]) – x[I] <- R * y[I]

– # -- – data(iris) – iris$Sepal.Length[1:10] <- NA – model <- lm(Sepal.Length ~ Sepal.Width + Petal.Width, data = iris) – I <- is.na(iris$Sepal.Length) – iris$Sepal.Length[I] <- predict(model, newdata = iris[I, ])

dplyr 기초

• 6가지의 주된 함수 – Pick observations by their values (filter()). – Reorder the rows (arrange()). – Pick variables by their names (select()). – Create new variables with functions of existing variables (mutate()). – Collapse many values down to a single summary (summarise()). – + group_by()

• changes the scope of each function from operating on the entire dataset to operating on it group-by-group.

• 사용법

– The first argument is a data frame. – The subsequent arguments describe what to do with the data

frame, using the variable names (without quotes). – The result is a new data frame.

• dplyr 의 filter에서의 logical operation

Tidy data set

• rules for a tidy dataset : – Each variable must have its own column.

– Each observation must have its own row.

– Each value must have its own cell.

> table4a # A tibble: 3 × 3 country `1999` `2000` * <chr> <int> <int> 1 Afghanistan 745 2666 2 Brazil 37737 80488 3 China 212258 213766

> table4a %>% + gather(`1999`, `2000`, key = "year", value = "cases") # A tibble: 6 × 3 country year cases <chr> <chr> <int> 1 Afghanistan 1999 745 2 Brazil 1999 37737 3 China 1999 212258 4 Afghanistan 2000 2666 5 Brazil 2000 80488 6 China 2000 213766

Relational data

• A primary key – uniquely identifies an observation in its own table. – (ex) planes$tailnum is a primary key

• A foreign key – uniquely identifies an observation in another table. – (ex) flights$tailnum is a foreign key

시각화

(v.0.9) 82

개요

• Base Graphics – plot()

• hist() and boxplot().

– points(), lines(), text(), mtext(), axis(), rug(), identify()

• 특화 패키지: – http://www.comput

erworld.com/article/2921176/business-intelligence/great-r-packages-for-data-import-wrangling-visualization.html

http://www.computerworld.com/article/2921176/business-intelligence/great-r-packages-for-data-import-wrangling-visualization.html



















Lattice Graphics

• Lattice = a flavour of trellis graphics – For lattice, graphics formulae are mandatory.

– grid = a low-level graphics system. It was used to build lattice.

• Lattice vs. base graphics – xyplot() vs. plot() – plot() gives a graph as a side effect of the command.

– xyplot() generates a graphics object.

• As this is output to the command line, the object is “printed”, i.e., a graph appears.

84

graph_type 설명 formula 예

barchart bar chart x~A or A~x

bwplot boxplot x~A or A~x

cloud 3D scatterplot z~x*y|A

contourplot 3D contour plot z~x*y

densityplot kernal density plot ~x|A*B

dotplot dotplot ~x|A

histogram histogram ~x

levelplot 3D level plot z~y*x

parallel parallel coordinates plot data frame

splom scatterplot matrix data frame

stripplot strip plots A~x or x~A

xyplot scatterplot y~x|A

wireframe 3D wireframe graph z~y*x

85

ggplot2

• ggplot2 의 특징 – (장점) Consistent underlying grammar of graphics – (한계) Things you cannot do:

• 3-dimensional graphics • Graph-theory type graphs (nodes/edges layout)

• Anatomy of a plot: – data aesthetic mapping – geometric object statistical transformations – scales coordinate system – position adjustments faceting

• ggplot2 vs. Base Graphics – ggplot2 is more verbose for simple / canned graphics – is less verbose for complex / custom graphics – does not have methods (data should always be in a data.frame) – uses a diferent system for adding plot elements

• Geometric Objects And Aesthetics – Aesthetic Mapping

• ggplot 에서 aesthetic 이란 "something you can see" – position (i.e., on the x and y axes)

– color ("outside" color)

– fill ("inside" color)

– shape (of points)

– linetype

– size

• > aes()

– Geometric Objects • = actual marks we put on a plot

– points (geom_point, for scatter plots, dot plots, etc)

– lines (geom_line, for time series, trend lines, etc)

– boxplot (geom_boxplot, for, well, boxplots!)

기계학습 모델링

88

Machine Learning?

• 개념 – = subfield of Artificial Intelligence (AI)

– “construction and study of systems that can learn from data”

• 종류 – http://en.wikipedia.org/wiki/Machine_learning

• 방법론 – “A computer program learns from experience (E) with some cl

ass of tasks (T) and a performance measure (P) if its performance at tasks in T as measured by P improves with E”

89

http://en.wikipedia.org/wiki/Machine_learning

http://en.wikipedia.org/wiki/Machine_learning

• 용어 – Features

• = distinct traits that can be used to describe each item in a quantitative manner.

– Samples • an item to process (e.g. classify). • document, a picture, a sound, a video, a row in database or CSV file,

…

– Feature vector • an n-dimensional vector of numerical features that represent some o

bject.

– Feature extraction • Preparation of feature vector – transforms the data in the high-dimen

sional space to a space of fewer dimensions. •

– Training/Evolution set • Set of data to discover potentially predictive relationships.

90

학습 (Learning) vs. 훈련 (Training )

91

절차

92

종류

• Supervised Learning

• Unsupervised Learning

• Semi-Supervised Learning

• Reinforcement Learning – allows the machine or software agent to learn its behavior bas

ed on feedback from the environment.

– This behavior can be learnt once and for all, or keep on adapting as time goes by.

93

ML Algorithm의 유형

• Predictive model – = target 변수와 다른 feature들 사이의 관계를 발견 또는 모델링

하려는 것

– = supervised learning clear instruction on what they learn & how (단, 사람 아닌 target values provide a supervisory role to find …)

• Descriptive model – Nor target to learn No single feature is more important tha

n other

– (ex) Pattern discovery (Market basket analysis, clustering)

94

Supervised L Unsupervised L Other Applications Remarks

NN

(Naïve) Bayes

Decision Tree

(Classif’n Rule L)

Linear Regression

Model Tree

Neural Net

SVM

AR

K-means

..

Marketing Anal.

… 95

KNN

96

KNN

• = classify unlabeled examples by assigning them to the class of the most similar labeled example

• [사례] Blind testing을 통한 tomato 분류배정

97

• 거리계산 – Euclidean distance

– Manhattan D

• NN – 1NN ; orange이므로 as fruit

– 3NN ; vote among the 3 nearest neighbor

• 적절한 K 값의 선택

Large K Small K

Bias 감소 Variance 감소

But Underfitting But Overfitting

Single K outlier

실무: 학습대상 concept의 복잡성, training data의 개수 101

절차

• Data 준비 – 사전준비 – transform features to a standard range

• Shrinking, rescaling min-max normalization

• Z-score standardization

– Nominal feature의 경우 dummy coding • 단, Ordinal data의 경우 number 부여 후 normalize

• 특징 – Lazy Learning No abstraction, No generalization

– 대신, instance-based Learning

– Non-parametric Learning

102

응용

• Voronoi Diagram – Training example에 의거한 Decision surface

103

BAYESIAN과 NAÏVE BAYES

104

기본 개념

• 배경: – the estimated likelihood of an event should be based on the e

vidence at hand.

• 확률

105

• 조건부 확률 - Bayes’ theorem

– 사전확률 (prior probability)

• the most reasonable guess would be the probability that any prior message was spam (~ 20 % in the example).

– likelihood

• Probability that the word Viagra was used in previous spam messages

– Marginal likelihood

• Probability that Viagra appeared in any message at all.

• 사후확률 (posterior probability) – Bayes' theorem을 이용해서 메시지가 spam일 사후확률 계산

– IF ( > 50 %) THEN message is likely to be spam should be filtered.

106

• Bayes’ theorem의 적용

• Frequency table

• P(spam∩Viagra) = P(Viagram|spam) * P(spam) = (4/20)((20/100)=0.04

• P(spam|Viagra) = P(Viagra|spam) * P(spam) / P(Viagra) = (4/20)*(20/100)/(5/100) = 0.80

107

Naïve Bayes 분류

• spam main 예의 확장 – Train by constructing likelihood table for the appearance of 4

words:

– 확률계산 (ex) Viagra=Y, Money=N, Groceries=N,

Unsubscribe=Y

– Class-conditional independence를 이용하여 계산 단순화

108

– likelihood of ham: • (4/20) * (10/20) * (20/20) * (12/20) * (20/100) = 0.012

• Spam일 확률: 0.012 / (0.012 + 0.002) = 0.857

– likelihood of ham : • (1/80) * (66/80) * (71/80) * (23/80) * (80/100) = 0.002

• Ham일 확률: 0.002 / (0.012 + 0.002) = 0.143

– => expect that the message is spam (85.7 %), ham with 14.3 %. • 즉, “this message is 6 times more likely to be spam than ham.”

– probability of level L for class C, given the evidence provided by features F1 ~ Fn, is:

109

• Laplace estimator – (IF) message contains: Viagra, Groceries, Money, Unsubscribe.

• naive Bayes algorithm 에서의 likelihood of spam:

– (4/20) * (10/20) * (0/20) * (12/20) * (20/100) = 0

• And the likelihood of ham is:

– (1/80) * (14/80) * (8/80) * (23/80) * (80/100) = 0.00005

• probability of spam is: 0 / (0 + 0.0099) = 0, probability of ham is: 0.00005 / (0 + 0. 0.00005) = 1

– (Solution) Laplace estimator

• frequency table의 count에 숫자 (1) 가산 ensure that each feature has a nonzero probability of occurring with each class.

• Naïve Bayes에서 numeric feature 사용 – By discretizing/binning

110

사례 – 휴대폰 spam filtering

• SMS에서의 spam filter

• 데이터: sms_spam.csv

• Package: tm 이용 – Corpus() corpus (= a collection of text document) 생성

– VectorSource() tell Corpus() to use the message in the vector sms_train$text

• ** tm package **

111

r을 이용한 통계기반 데이터 분석 - openwith.net°이터분석pv3... · uncorrelated...

Documents