python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 •...

40
Python 데이터 분석 보충자료 윤형기 ([email protected])

Upload: others

Post on 07-Sep-2019

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

Python 데이터 분석 보충자료

윤형기 ([email protected])

Page 2: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

회귀분석 REGRESSION

단순/다중회귀분석

Logistic Regression

Page 3: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

Regression

• 개요 – single numeric D.V. (value to be predicted)과 one or more

numeric I.V. (predictors)간의 관계식.

– "regression" = process of fitting lines to data (Galton)

– also used for hypothesis testing, determining whether data indicate that a presupposition is more likely to be true or false.

• 다양한 모델에 적용 – SLR

– MLR

– GLM

– Link functions

– Logistic regression, Poisson regression, …

3

Page 4: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

단순회귀분석

• 단순회귀분석 – 종속변수 = the variable to be predicted (y).

– 독립변수 = explanatory variable = The predictor (x).

– 대상: only a straight-line relationship between 2 variables

• 회귀직선식의 결정 • deterministic regression model is y = β0 + β1x

• probabilistic regression model is y = β0 + β1x + ε

4 http://www.openwith.net

Page 5: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

• OLS (Ordinary Least Squares)

5

Page 6: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

– 잔차분석

6 http://www.openwith.net

Page 7: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

– 추정값의 표준오차 • error분석을 위해 잔차 (= 개별 point에 대한 estimation errors) 계산 대신 standard error of the estimate 이용.

– SSE is in part a function of the number of pairs of data being used to compute the sum, which lessens the value of SSE as a measurement of error.

– 더 좋은 지표 = standard error of the estimate (se) is a standard deviation of the error of the regression model.

– (정규분포 empirical rule: “68% 가 μ+ 1σ 범위, 95%가 μ+ 2σ 범위.

regression의 assumption도 for a given x, error terms ~ ND() )

– 이제 error terms ~ ND(), se 는 error의 s.d., AVG error =0 이므로

» 68% of the error values (residuals) should be within 0 ±1se

» 95% of the error values (residuals) should be within 0 ±2se.

– se provides a single measure of magnitude of errors in model.

– 또한 outlier 식별에 이용. (예: outside ±2se or ±3se)

7 http://www.openwith.net

Page 8: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

– 결정계수

• R2 = I.V. (x)가 variability of D.V. (y)를 얼마나 설명하는가

» r2=0 … r2= 1

– D.V. (y) has a variation, measured by SS of y (SSyy):

» SSyy=SSR +SSE

» If each term is divided by SSyy , the resulting equation is

– r2 is proportion of y variability explained by regression model:

• Relationship Between r and r2

– r2 = (r)2

» coeff’t of correlation & determination

– 회귀모델 기울기의 가설검정 & 모델 전반의 Testing

• 기울기

– r = (r)2

8

Page 9: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

계수추정

– OLS chooses β0 and β1 to minimize the RSS, using some

calculus

Page 10: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

추정된 계수의 Accuracy

• (Q) μ 추정치가 얼마나 정확한가?

• (A) SE(μ ) (=standard error of μ )를 계산

– 즉, β0 와 β1에 대한 표준오차의 계산

• residual standard error RSE =RSS

(n−2).

Page 11: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

선형모델의 Accuracy

• Residual Standard Error

• R2 Statistic

– r = Cor(x, y)

– R2 = r2

Page 12: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

다중회귀분석

• SLR과 MLR • 단순회귀모델: y =β0 + β1x +ε

• 다중회귀모델: y =β0 + β1x1 + β2x2 + …+ βkxk +ε

• 독립변수를 가진 MR Model (First Order) – y = β0 + β1 x1 + β2 x2 +ε

– Constant & coefficients는 표본으로부터 추출: y =b0 +b1x1 +b2x2 response surface / response plane

• 회귀모델과 계수에 대한 유의성 검정 – <Regression 모델의 adequacy 분석>

– 모델 전반의 검정 • 단순회귀; t test of slope of the regression line to see if ≠ 0. (즉,

whether I.V. contribute significantly in predicting D.V. )

• 다중회귀; an analogous test makes use of F statistic.

12

Page 13: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

13

Page 14: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

– 회귀계수에 대한 Significance Tests

• individual significance tests for each regression coefficient with t test.

– H0: β1 =0 H0: β2 =0 … H0: βk =0

– Ha: β1 ≠ 0 Ha: β2 ≠ 0 Ha: βk ≠ 0

– d.f. for each of individual tests of regression coefficients are n - k - 1.

– 추정치의 잔차와 표준오차 및 R2

• Residuals

– = error of the regression model

– 활용: outlier 탐지, regression분석 시 assumptions 검정

• SSE 와 Standard Error of the Estimate

– = 추정 값의 표준오차 = 추정표준오차(표준추정오차)= 차이의 표준오차

– = 최적선에 대한 산포도에서 점들의 분산도

– = 𝑦 를 중심으로 실제 y 점수분포가 (회귀선에 의한) 어느 정도인가 표시

– SSE =Σ(y - 𝑦 )2

– 회귀분석의 가정 (error terms ~ ND(0) + 경험칙 (대략 잔차의 68%가 ±1se 범위, 95% 가 ±2se 범위) 회귀모델의 데이터 fitting정도를 측정하는데 standard error of estimate가 유용.

14

Page 15: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

계수추정 (MLR)

Page 16: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

주요 이슈

• (1) Response-Predictors간의 관계성 여부? – 가설검정

• H0: β1 = β2 = ···= βp =0

• Ha: at least one βj is non-zero.

– F-statistic 계산:

• 단, TSS = (yi − y )2 and RSS = (yi − yi )2.

– IF H0 is true (=response-predictors간 no relationship) THEN F 값은 1에 근접

– IF Ha is true,

– THEN E{(TSS - RSS)/p} >σ2, so we expect F > 1 .

Page 17: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

• (2) 변수 별 중요도 결정 – Variable Selection

• Mallow’s Cp,

• Akaike information criterion (AIC),

• Bayesian information criterion (BIC),

• adjusted R2

– 그런데 2p 모델

• Forward selection

• Backward selection

• Mixed selection

Page 18: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

• (3) Model Fit – In SLR, R2 = 설명변수와 상관계수간의 상관계수의 제곱

– In MLR, it equals Cor(Y, Y .)2

– fitted linear model의 특징: maximizes this correlation among all possible linear models.

– p-value를 통해 R2 의 개선 정도를 계수화

– RSE의 정의:

• Thus, models with more variables can have higher RSE if the decrease in RSS is small relative to the increase in p.

Page 19: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

• (4) Predictions

– β0, β1,..., βp의 true value를 안다 해도 random error로 인해 완벽

한 예측은 불가능. (즉, irreducible error)

– confidence interval

– prediction interval

Page 20: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

기타의 주요 이슈

• Interaction terms

• Non-linear effects

• Multicollinearity

• Model Selection

20

Page 21: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

• mathematical transformation을 통한 Non-linear models – <first-order model>

• one independent variable: y = β0 + β1 x1 +ε • two independent variables: y = β0 + β1 x1 + β2 x2 +ε

– <polynomial regression model> • ; contain squared, cubed, or higher powers of the predictor variable(s) and

contain response surfaces that are curvilinear. Yet, they are still special cases of the general linear model given in formula:

• y = β0 + β1 x1 + β2 x2 + … + βk xk +ε

– <second-order model with one independent variable> • y = β0 + β1 x1 + β2 x2

2 + ε

– <Quadratic model> 次數가 2차 (=polynomial equation of degree 2) • = a special case of the general linear model –curvilinear regression by

recoding the data before the multiple regression analysis is attempted.

– Quadratic form (2차 형식) – quadratic curve (2차 곡선)

• XTAX = 𝑥1𝑥2

•𝑎 𝑏𝑐 𝑑

• 𝑥1 𝑥2

• = ax12 + bx1x2 + xc1x2 + dx2

2

• Non-linear 특화 모델 (後述)

21

Page 22: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

Model Transformation

• 개념 – exponential model log antilog

– inverse model

• Tukey의 Ladder of Transformations

22

Page 23: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

• Non-linear Relationships의 예 – Polynomial regression

Page 24: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

• Regression 분석에서의 interaction – Interaction항목을 별도의 독립변수로 검토

• interaction predictor variable can be designed by multiplying the data values of one variable by the values of another variable,

• y = β0 + β1 x1 + β2 x2 + β3 x1 x2 +ε

– (x1x2 term = interaction term).

• Even though this model has 1 as the highest power of any one variable, it is considered to be a second-order equation because of the x1x2 term.

24

Page 25: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

• 모델구축: Search 절차 – 회귀모델 개발:

• (i) maximize explained proportion of the deviation of y values.

• (ii) Be as parsimonious as possible.

– Search 절차 • All Possible Regressions (모든 가능한 조합의 회귀분석)

– If a data set contains k independent variables, all possible regressions will determine 2k -1 different models.

• Stepwise Regression (단계적 회귀분석) – single predictor variable 에서 시작해서 adds and deletes predictors one

step at a time, examining the fit of the model at each step until no more significant predictors remain outside the model.

– STEP 1/2/3: …

• Forward Selection (전진선택법) – = stepwise regression과 동일. 단, once a variable is entered into the

process, it is never dropped out.

• Backward Elimination (후진제거법) – …

25

Page 26: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

• Multicollinearity (다중공선성)

– = 2 이상 독립변수가 highly correlated. (2개: collinearity; 여러 개: multicollinearity)

– 1. It is difficult to interpret the estimates of the regression coeff’ts.

– 2. Inordinately small t values for regression coefficients may result.

– 3. S.D. of regression coefficients are overestimated.

– 4. The algebraic sign of estimated regression coefficients may be the opposite of what would be expected for a particular predictor value.

– multicollinearity문제는 regression 계수를 평가하는 t값에도 영향.

• Multicollinearity can result in an overestimation of s.d. of the regression coefficients t values tend to be underrepresentative when multicollinearity is present.

– (접근법)

• correlation matrix를 조사하여 가능한 intercorrelations를 탐색.

• Stepwise regression to prevent the problem of multicollinearity.

26

Page 27: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

Interaction

• 개념 – When the effect on Y of increasing X1 depends on another X2.

• 예: – Advertising 예:

• TV and radio advertising both increase sales.

27

Sales = b0 +b1 ´TV +b2 ´Radio+b3 ´TV ´Radio

Intercept

TV

Radio

TV*Radio

Term

6.7502202

0.0191011

0.0288603

0.0010865

Estimate

0.247871

0.001504

0.008905

5.242e-5

Std Error

27.23

12.70

3.24

20.73

t Ratio

<.0001 *

<.0001 *

0.0014 *

<.0001 *

Prob>|t|

Parameter Estimates

Page 28: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

• Dummy coding – 예: “men” and “women” (category listings)

• Code as indicator variables (dummy variables); Male=0, Female=1.

• Suppose we want to include income and gender.

– β2 = average extra balance each month that females have for given income level. Males are the “baseline”.

28

Page 29: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

29

Regression equation

female: salary = 112.77+1.86 + 6.05 position

males: salary = 112.77-1.86 + 6.05 position

Different intercepts Same slopes 120

130

140

150

160

170

0 2.5 5 7.5 10

Position

Line for women

Line for men Regression coefficients

Coefficient Std Err t-value p-value

Constant 233.7663 39.5322 5.9133 0.0000

Income 0.0061 0.0006 10.4372 0.0000

Gender_Female 24.3108 40.8470 0.5952 0.5521

Page 30: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

모델 평가와 변수선택

• 개념 – 독립변수의 개수가 많을 경우 이를 축소하여 단순화

– 즉, OLS fitting에 대한 alternative fitting을 통한 MSE 최소화

– 필요성

• Prediction Accuracy

• Model Interpretability

• Subset Selection – Stepwise Selection

– Choosing the Optimal Model

• Shrinkage Methods – Ridge Regression

– The Lasso

Page 31: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

• 1. Prediction Accuracy – X와 Y의 관계가 선형이고 n >>p 일 때는 비교적 low bias, low

variance (단, n= # of observations, p= # of predictors)

– 그러나 • when , OLS fit can have high variance and may result in

overfitting and poor estimates on unseen observations,

• when , the variability of the least squares fit increases dramatically, and the variance of these estimates in infinite

• 2. Model Interpretability – 독립변수 X의 개수가 많을 경우 이들의 Y에 대한 효과가 감소

• Leaving these variables in the model makes it harder to see the “big picture”, i.e., the effect of the “important variables”

• The model would be easier to interpret by removing (i.e. setting the coefficients to zero) the unimportant variables

31

n » p

n < p

Page 32: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

Solution

• Subset Selection – 전체 p개의 설명변수 X의 일부분 (subset)을 식별해 낸 후 이를 이용

해서 모델 fitting

– 예: best subset selection, stepwise selection

• Shrinkage – Shrink the estimates coefficients towards zero reduces variance

– Some of the coefficients may shrink to exactly zero, and hence shrinkage methods can also perform variable selection

– 예: Ridge regression, Lasso

• 차원축소 (Dimension Reduction) – Involves projecting all p predictors into an M-dimensional space

where M < p, and then fitting linear regression model

– 예: Principle Components Regression

32

Page 33: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

Best Subset Selection

• One simple approach – = take the subset with the smallest RSS or the largest R2.

– 단, 모델의 변수가 많아질 수록 R2 증가 (== smallest RSS)

– 예

33

Page 34: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

• Measures of Comparison – Add penalty to RSS for the number of variables (complexity)

– 종류

• Adjusted R2

• AIC (Akaike information criterion)

• BIC (Bayesian information criterion)

• Cp (equivalent to AIC for linear regression)

34

Page 35: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

• Stepwise Selection – 배경

• Best Subset Selection is computationally intensive especially when we have a large number of predictors (large p)

– More attractive methods:

• Forward Stepwise Selection:

– Begins with the model containing no predictor, and then adds one predictor at a time that improves the model the most until no further improvement is possible

• Backward Stepwise Selection:

– Begins with the model containing all predictors, and then deleting one predictor at a time that improves the model the most until no further improvement is possible

35

Page 36: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

Shrinkage Methods

• Ridge Regression – Ordinary Least Squares (OLS) estimates β by minimizing

– Ridge Regression uses a slightly different equation

– Tuning parameter λ • is a positive value.

• has the effect of “shrinking” large values of β towards zero.

• It turns out that such a constraint should improve the fit, because shrinking the coefficients can significantly reduce their variance

• Notice that when λ = 0, we get the OLS!

36

Page 37: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

– As λ increases, standardized coefficients shrinks towards 0.

– 효과 • OLS estimates generally have low bias but can be highly variable.

In particular when n and p are of similar size or when n < p, then the OLS estimates will be extremely variable

• The penalty term makes the ridge regression estimates biased but can also substantially reduce variance

37

Page 38: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

– 효과 • 일반적으로,

– RR estimates will be more biased than OLS but have lower variance

– Ridge regression will work best in situations where the OLS estimates have high variance

• If p is large, using best subset selection approach requires searching through enormous numbers of possible models

• For Ridge Regression, for any given λ, we only need to fit one model and the computations turn out to be very simple

– Ridge Regression can even be used even when p > n!

38

Page 39: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

Lasso

• 개념 – (배경) Ridge Regression isn’t perfect

• the penalty term will never force any of the coefficients to be exactly zero. Thus, the final model will include all variables, which makes it harder to interpret

• LASSO 역시 유사하지만 penalty term 이 다름

• Penalty term – Ridge Regression minimizes

– The LASSO estimates β by minimizing the

39

Page 40: Python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 • 단순회귀분석 –종속변수 = the variable to be predicted (y). –독립변수

• Tuning parameter λ의 선택 – Select a grid of potential values, use cross validation to

estimate the error rate on test data (for each value of λ) and select the value that gives the least error rate

40