python 데이터 분석 보충자료 - openwith.net³´충자료2.pdf · 단순회귀분석 •...

Post on 07-Sep-2019

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Python 데이터 분석 보충자료

윤형기 (hky@openwith.net)

회귀분석 REGRESSION

단순/다중회귀분석

Logistic Regression

Regression

• 개요 – single numeric D.V. (value to be predicted)과 one or more

numeric I.V. (predictors)간의 관계식.

– "regression" = process of fitting lines to data (Galton)

– also used for hypothesis testing, determining whether data indicate that a presupposition is more likely to be true or false.

• 다양한 모델에 적용 – SLR

– MLR

– GLM

– Link functions

– Logistic regression, Poisson regression, …

3

단순회귀분석

• 단순회귀분석 – 종속변수 = the variable to be predicted (y).

– 독립변수 = explanatory variable = The predictor (x).

– 대상: only a straight-line relationship between 2 variables

• 회귀직선식의 결정 • deterministic regression model is y = β0 + β1x

• probabilistic regression model is y = β0 + β1x + ε

4 http://www.openwith.net

• OLS (Ordinary Least Squares)

5

– 잔차분석

6 http://www.openwith.net

– 추정값의 표준오차 • error분석을 위해 잔차 (= 개별 point에 대한 estimation errors) 계산 대신 standard error of the estimate 이용.

– SSE is in part a function of the number of pairs of data being used to compute the sum, which lessens the value of SSE as a measurement of error.

– 더 좋은 지표 = standard error of the estimate (se) is a standard deviation of the error of the regression model.

– (정규분포 empirical rule: “68% 가 μ+ 1σ 범위, 95%가 μ+ 2σ 범위.

regression의 assumption도 for a given x, error terms ~ ND() )

– 이제 error terms ~ ND(), se 는 error의 s.d., AVG error =0 이므로

» 68% of the error values (residuals) should be within 0 ±1se

» 95% of the error values (residuals) should be within 0 ±2se.

– se provides a single measure of magnitude of errors in model.

– 또한 outlier 식별에 이용. (예: outside ±2se or ±3se)

7 http://www.openwith.net

– 결정계수

• R2 = I.V. (x)가 variability of D.V. (y)를 얼마나 설명하는가

» r2=0 … r2= 1

– D.V. (y) has a variation, measured by SS of y (SSyy):

» SSyy=SSR +SSE

» If each term is divided by SSyy , the resulting equation is

– r2 is proportion of y variability explained by regression model:

• Relationship Between r and r2

– r2 = (r)2

» coeff’t of correlation & determination

– 회귀모델 기울기의 가설검정 & 모델 전반의 Testing

• 기울기

– r = (r)2

8

계수추정

– OLS chooses β0 and β1 to minimize the RSS, using some

calculus

추정된 계수의 Accuracy

• (Q) μ 추정치가 얼마나 정확한가?

• (A) SE(μ ) (=standard error of μ )를 계산

– 즉, β0 와 β1에 대한 표준오차의 계산

• residual standard error RSE =RSS

(n−2).

선형모델의 Accuracy

• Residual Standard Error

• R2 Statistic

– r = Cor(x, y)

– R2 = r2

다중회귀분석

• SLR과 MLR • 단순회귀모델: y =β0 + β1x +ε

• 다중회귀모델: y =β0 + β1x1 + β2x2 + …+ βkxk +ε

• 독립변수를 가진 MR Model (First Order) – y = β0 + β1 x1 + β2 x2 +ε

– Constant & coefficients는 표본으로부터 추출: y =b0 +b1x1 +b2x2 response surface / response plane

• 회귀모델과 계수에 대한 유의성 검정 – <Regression 모델의 adequacy 분석>

– 모델 전반의 검정 • 단순회귀; t test of slope of the regression line to see if ≠ 0. (즉,

whether I.V. contribute significantly in predicting D.V. )

• 다중회귀; an analogous test makes use of F statistic.

12

13

– 회귀계수에 대한 Significance Tests

• individual significance tests for each regression coefficient with t test.

– H0: β1 =0 H0: β2 =0 … H0: βk =0

– Ha: β1 ≠ 0 Ha: β2 ≠ 0 Ha: βk ≠ 0

– d.f. for each of individual tests of regression coefficients are n - k - 1.

– 추정치의 잔차와 표준오차 및 R2

• Residuals

– = error of the regression model

– 활용: outlier 탐지, regression분석 시 assumptions 검정

• SSE 와 Standard Error of the Estimate

– = 추정 값의 표준오차 = 추정표준오차(표준추정오차)= 차이의 표준오차

– = 최적선에 대한 산포도에서 점들의 분산도

– = 𝑦 를 중심으로 실제 y 점수분포가 (회귀선에 의한) 어느 정도인가 표시

– SSE =Σ(y - 𝑦 )2

– 회귀분석의 가정 (error terms ~ ND(0) + 경험칙 (대략 잔차의 68%가 ±1se 범위, 95% 가 ±2se 범위) 회귀모델의 데이터 fitting정도를 측정하는데 standard error of estimate가 유용.

14

계수추정 (MLR)

주요 이슈

• (1) Response-Predictors간의 관계성 여부? – 가설검정

• H0: β1 = β2 = ···= βp =0

• Ha: at least one βj is non-zero.

– F-statistic 계산:

• 단, TSS = (yi − y )2 and RSS = (yi − yi )2.

– IF H0 is true (=response-predictors간 no relationship) THEN F 값은 1에 근접

– IF Ha is true,

– THEN E{(TSS - RSS)/p} >σ2, so we expect F > 1 .

• (2) 변수 별 중요도 결정 – Variable Selection

• Mallow’s Cp,

• Akaike information criterion (AIC),

• Bayesian information criterion (BIC),

• adjusted R2

– 그런데 2p 모델

• Forward selection

• Backward selection

• Mixed selection

• (3) Model Fit – In SLR, R2 = 설명변수와 상관계수간의 상관계수의 제곱

– In MLR, it equals Cor(Y, Y .)2

– fitted linear model의 특징: maximizes this correlation among all possible linear models.

– p-value를 통해 R2 의 개선 정도를 계수화

– RSE의 정의:

• Thus, models with more variables can have higher RSE if the decrease in RSS is small relative to the increase in p.

• (4) Predictions

– β0, β1,..., βp의 true value를 안다 해도 random error로 인해 완벽

한 예측은 불가능. (즉, irreducible error)

– confidence interval

– prediction interval

기타의 주요 이슈

• Interaction terms

• Non-linear effects

• Multicollinearity

• Model Selection

20

• mathematical transformation을 통한 Non-linear models – <first-order model>

• one independent variable: y = β0 + β1 x1 +ε • two independent variables: y = β0 + β1 x1 + β2 x2 +ε

– <polynomial regression model> • ; contain squared, cubed, or higher powers of the predictor variable(s) and

contain response surfaces that are curvilinear. Yet, they are still special cases of the general linear model given in formula:

• y = β0 + β1 x1 + β2 x2 + … + βk xk +ε

– <second-order model with one independent variable> • y = β0 + β1 x1 + β2 x2

2 + ε

– <Quadratic model> 次數가 2차 (=polynomial equation of degree 2) • = a special case of the general linear model –curvilinear regression by

recoding the data before the multiple regression analysis is attempted.

– Quadratic form (2차 형식) – quadratic curve (2차 곡선)

• XTAX = 𝑥1𝑥2

•𝑎 𝑏𝑐 𝑑

• 𝑥1 𝑥2

• = ax12 + bx1x2 + xc1x2 + dx2

2

• Non-linear 특화 모델 (後述)

21

Model Transformation

• 개념 – exponential model log antilog

– inverse model

• Tukey의 Ladder of Transformations

22

• Non-linear Relationships의 예 – Polynomial regression

• Regression 분석에서의 interaction – Interaction항목을 별도의 독립변수로 검토

• interaction predictor variable can be designed by multiplying the data values of one variable by the values of another variable,

• y = β0 + β1 x1 + β2 x2 + β3 x1 x2 +ε

– (x1x2 term = interaction term).

• Even though this model has 1 as the highest power of any one variable, it is considered to be a second-order equation because of the x1x2 term.

24

• 모델구축: Search 절차 – 회귀모델 개발:

• (i) maximize explained proportion of the deviation of y values.

• (ii) Be as parsimonious as possible.

– Search 절차 • All Possible Regressions (모든 가능한 조합의 회귀분석)

– If a data set contains k independent variables, all possible regressions will determine 2k -1 different models.

• Stepwise Regression (단계적 회귀분석) – single predictor variable 에서 시작해서 adds and deletes predictors one

step at a time, examining the fit of the model at each step until no more significant predictors remain outside the model.

– STEP 1/2/3: …

• Forward Selection (전진선택법) – = stepwise regression과 동일. 단, once a variable is entered into the

process, it is never dropped out.

• Backward Elimination (후진제거법) – …

25

• Multicollinearity (다중공선성)

– = 2 이상 독립변수가 highly correlated. (2개: collinearity; 여러 개: multicollinearity)

– 1. It is difficult to interpret the estimates of the regression coeff’ts.

– 2. Inordinately small t values for regression coefficients may result.

– 3. S.D. of regression coefficients are overestimated.

– 4. The algebraic sign of estimated regression coefficients may be the opposite of what would be expected for a particular predictor value.

– multicollinearity문제는 regression 계수를 평가하는 t값에도 영향.

• Multicollinearity can result in an overestimation of s.d. of the regression coefficients t values tend to be underrepresentative when multicollinearity is present.

– (접근법)

• correlation matrix를 조사하여 가능한 intercorrelations를 탐색.

• Stepwise regression to prevent the problem of multicollinearity.

26

Interaction

• 개념 – When the effect on Y of increasing X1 depends on another X2.

• 예: – Advertising 예:

• TV and radio advertising both increase sales.

27

Sales = b0 +b1 ´TV +b2 ´Radio+b3 ´TV ´Radio

Intercept

TV

Radio

TV*Radio

Term

6.7502202

0.0191011

0.0288603

0.0010865

Estimate

0.247871

0.001504

0.008905

5.242e-5

Std Error

27.23

12.70

3.24

20.73

t Ratio

<.0001 *

<.0001 *

0.0014 *

<.0001 *

Prob>|t|

Parameter Estimates

• Dummy coding – 예: “men” and “women” (category listings)

• Code as indicator variables (dummy variables); Male=0, Female=1.

• Suppose we want to include income and gender.

– β2 = average extra balance each month that females have for given income level. Males are the “baseline”.

28

29

Regression equation

female: salary = 112.77+1.86 + 6.05 position

males: salary = 112.77-1.86 + 6.05 position

Different intercepts Same slopes 120

130

140

150

160

170

0 2.5 5 7.5 10

Position

Line for women

Line for men Regression coefficients

Coefficient Std Err t-value p-value

Constant 233.7663 39.5322 5.9133 0.0000

Income 0.0061 0.0006 10.4372 0.0000

Gender_Female 24.3108 40.8470 0.5952 0.5521

모델 평가와 변수선택

• 개념 – 독립변수의 개수가 많을 경우 이를 축소하여 단순화

– 즉, OLS fitting에 대한 alternative fitting을 통한 MSE 최소화

– 필요성

• Prediction Accuracy

• Model Interpretability

• Subset Selection – Stepwise Selection

– Choosing the Optimal Model

• Shrinkage Methods – Ridge Regression

– The Lasso

• 1. Prediction Accuracy – X와 Y의 관계가 선형이고 n >>p 일 때는 비교적 low bias, low

variance (단, n= # of observations, p= # of predictors)

– 그러나 • when , OLS fit can have high variance and may result in

overfitting and poor estimates on unseen observations,

• when , the variability of the least squares fit increases dramatically, and the variance of these estimates in infinite

• 2. Model Interpretability – 독립변수 X의 개수가 많을 경우 이들의 Y에 대한 효과가 감소

• Leaving these variables in the model makes it harder to see the “big picture”, i.e., the effect of the “important variables”

• The model would be easier to interpret by removing (i.e. setting the coefficients to zero) the unimportant variables

31

n » p

n < p

Solution

• Subset Selection – 전체 p개의 설명변수 X의 일부분 (subset)을 식별해 낸 후 이를 이용

해서 모델 fitting

– 예: best subset selection, stepwise selection

• Shrinkage – Shrink the estimates coefficients towards zero reduces variance

– Some of the coefficients may shrink to exactly zero, and hence shrinkage methods can also perform variable selection

– 예: Ridge regression, Lasso

• 차원축소 (Dimension Reduction) – Involves projecting all p predictors into an M-dimensional space

where M < p, and then fitting linear regression model

– 예: Principle Components Regression

32

Best Subset Selection

• One simple approach – = take the subset with the smallest RSS or the largest R2.

– 단, 모델의 변수가 많아질 수록 R2 증가 (== smallest RSS)

– 예

33

• Measures of Comparison – Add penalty to RSS for the number of variables (complexity)

– 종류

• Adjusted R2

• AIC (Akaike information criterion)

• BIC (Bayesian information criterion)

• Cp (equivalent to AIC for linear regression)

34

• Stepwise Selection – 배경

• Best Subset Selection is computationally intensive especially when we have a large number of predictors (large p)

– More attractive methods:

• Forward Stepwise Selection:

– Begins with the model containing no predictor, and then adds one predictor at a time that improves the model the most until no further improvement is possible

• Backward Stepwise Selection:

– Begins with the model containing all predictors, and then deleting one predictor at a time that improves the model the most until no further improvement is possible

35

Shrinkage Methods

• Ridge Regression – Ordinary Least Squares (OLS) estimates β by minimizing

– Ridge Regression uses a slightly different equation

– Tuning parameter λ • is a positive value.

• has the effect of “shrinking” large values of β towards zero.

• It turns out that such a constraint should improve the fit, because shrinking the coefficients can significantly reduce their variance

• Notice that when λ = 0, we get the OLS!

36

– As λ increases, standardized coefficients shrinks towards 0.

– 효과 • OLS estimates generally have low bias but can be highly variable.

In particular when n and p are of similar size or when n < p, then the OLS estimates will be extremely variable

• The penalty term makes the ridge regression estimates biased but can also substantially reduce variance

37

– 효과 • 일반적으로,

– RR estimates will be more biased than OLS but have lower variance

– Ridge regression will work best in situations where the OLS estimates have high variance

• If p is large, using best subset selection approach requires searching through enormous numbers of possible models

• For Ridge Regression, for any given λ, we only need to fit one model and the computations turn out to be very simple

– Ridge Regression can even be used even when p > n!

38

Lasso

• 개념 – (배경) Ridge Regression isn’t perfect

• the penalty term will never force any of the coefficients to be exactly zero. Thus, the final model will include all variables, which makes it harder to interpret

• LASSO 역시 유사하지만 penalty term 이 다름

• Penalty term – Ridge Regression minimizes

– The LASSO estimates β by minimizing the

39

• Tuning parameter λ의 선택 – Select a grid of potential values, use cross validation to

estimate the error rate on test data (for each value of λ) and select the value that gives the least error rate

40

top related