multicollinearity diagnostics remedies

23
Topic 16 - Multicollinearity- diagnostics and remedies STAT/MATH 571A Professor Lingling An 16

Upload: maggi14

Post on 18-Nov-2014

580 views

Category:

Documents


0 download

DESCRIPTION

Multicollinearity Diagnostics in Regression

TRANSCRIPT

Page 1: Multicollinearity Diagnostics Remedies

Topic 16 - Multicollinearity-

diagnostics and remedies

STAT/MATH 571A

Professor Lingling An

16

Page 2: Multicollinearity Diagnostics Remedies

Outline

• Measures of multicollinearity

• Remedies

16-1

Page 3: Multicollinearity Diagnostics Remedies

Informal diagnostics-

multicollinearity

• regression coefficients change greatly when

predictors are included/excluded from the

models

• significant F-test but no significant t-tests

for β’s (ignoring intercept)

• regression coefficients that don’t “make sense”,

i.e. don’t match scatterplot and/or intu-

ition

• Type I and II SS very different

• predictors that have pairwise correlations

16-2

Page 4: Multicollinearity Diagnostics Remedies

Measures of multicollinearity

16-3

Page 5: Multicollinearity Diagnostics Remedies

Variance Inflation Factor

• The VIF is related to the variance of the

estimated regression coefficients

• σ2{b} = σ2(X ′X)−1, where b = (b0, b1, ..., bp−1)

• for b∗ = (s1syb1, ...,

sp−1sybp−1) it can be shown

that σ2{b∗} = (σ∗)2r−1XX

• where rXX is the correlation matrix of p-1

predictor variables =1 r12 ... r1,p−1r21 1 ... r2,p−1. . . .. . . .

rp−1,1 rp−1,2 ... 1

16-4

Page 6: Multicollinearity Diagnostics Remedies

Variance Inflation Factor

• VIFk is the the kth diagonal element of r−1XX

• Can show VIFk = (1−R2k)−1

• where R2k is the coefficient of multiple de-

termination when Xk is regressed on thep− 2 other X variables in the model.

• special case: if only two X variables, R21 =

R22 = r2

12

• so VIFk=1 when R2k = 0, i.e., when Xk is

not linearly related to the other X variables.

• when R2k 6= 0, then VIFk > 1, indicating an

inflated variabe for b∗k as a result of multi-collinearity amont the X variables.

16-5

Page 7: Multicollinearity Diagnostics Remedies

• when Xk has a perfect linear associationwith the other X variables in the model sothat R2

k = 1, then VIFk and σ2{b∗k} are un-bounded.

• Thumb of rule: VIFk of 10 or more sug-gests strong multicollinearity

• mean of the VIF values also profices in-formation about the severity of the multi-collinearity

• that is compare mean VIF to 1

VIF =

∑VIFk

p− 1

• mean VIF values considerably larger than1 are indicative of serious multicollinearityproblems

Page 8: Multicollinearity Diagnostics Remedies

Tolerance

• Equivalent to VIF

• Tolerance(TOL) = 1/VIF

• Trouble if TOL < .1

• SAS gives VIF and TOL results for each

predictor

• Use either VIF or TOL

16-6

Page 9: Multicollinearity Diagnostics Remedies

Example Page 256

• Twenty healthy female subjects

• Y is body fat via underwater weighing

• Underwater weighing expensive/difficult

• X1 is triceps skinfold thickness

• X2 is thigh circumference

• X3 is midarm circumference

16-7

Page 10: Multicollinearity Diagnostics Remedies

Output

Analysis of VarianceSum of Mean

Source DF Squares Square F Value Pr > FModel 3 396.98461 132.32820 21.52 <.0001Error 16 98.40489 6.15031Corrected Total 19 495.38950

Root MSE 2.47998 R-Square 0.8014Dependent Mean 20.19500 Adj R-Sq 0.7641Coeff Var 12.28017

Parameter EstimatesParameter Standard

Variable DF Estimate Error t Value Pr > |t|Intercept 1 117.08469 99.78240 1.17 0.2578skinfold 1 4.33409 3.01551 1.44 0.1699thigh 1 -2.85685 2.58202 -1.11 0.2849midarm 1 -2.18606 1.59550 -1.37 0.1896

16-8

Page 11: Multicollinearity Diagnostics Remedies

SAS code for VIF and TOL

proc reg data=a1;

model fat=skinfold thigh midarm/vif tol;

run;

Parameter Estimates

Parameter StandardVariable DF Estimate Error t Value Pr > |t| Tolerance

Intercept 1 117.08469 99.78240 1.17 0.2578 .skinfold 1 4.33409 3.01551 1.44 0.1699 0.00141thigh 1 -2.85685 2.58202 1.11 0.2849 0.00177midarm 1 -2.18606 1.59550 -1.37 0.1896 0.00956

Parameter Estimates

VarianceVariable DF Inflation

Intercept 1 0skinfold 1 708.84291thigh 1 564.34339midarm 1 104.60601

16-9

Page 12: Multicollinearity Diagnostics Remedies

Some remedial measures

• If the model is a polynomial regression model,

use centered variables.

• Drop one or more predictor variables (i.e.,

variable selection).

- Standard errors on the parameter esti-

mates decrease.

- However, how can we tell if the dropped

variable(s) give us any useful informa-

tion.

- If the variable is important, the param-

eter estimates become biased up.

• Sometimes, observations can be designed

to break the multicollinearity.

16-10

Page 13: Multicollinearity Diagnostics Remedies

• Get coefficient estimates from additional

data from other contexts.

For instance, if the model is Yi = β0 +

β1Xi1 + β2Xi2 + εi; and you have an es-

timator b1 (for β1 based on another data

set, you can estimate β2 by regressing the

adjusted variable Y ′i = Yi − b1Xi1 on Xi2.

(Common example: in economics, using

cross-sectional data to estimate parame-

ters for a time-dependent model.)

• Use the first few principal components (or

factor loadings) of the predictor variables.

(Limitation: may lose interpretability.)

Note: all these methods use the ordinary least

squares method; alternative way is to modify

the method of least squares - Biased regres-

sion or Coefficient Shrinkage (example: ridge

regression)

Page 14: Multicollinearity Diagnostics Remedies

Ridge Regression

• Modification of least squares that overcomes

multicollinearity problem

• Recall least squares suffers because (X′X) is

almost singular thereby resulting in highly

unstable parameter estimates

• Ridge regression results in biased but more

stable estimates

• Consider the correlation transformation so

the normal equations are given by rXXb =

rY X. Since rXX difficult to invert, we add

a bias constant, c.

bR = (rXX + cI)−1rY X

16-11

Page 15: Multicollinearity Diagnostics Remedies

Choice of c

• Key to approach is choice of c

• Common to use the ridge trace and VIF’s

– Ridge trace: simultaneous plot of p− 1 param-eter estimates for different values of c ≥ 0.Curves may fluctuate widely when c close tozero but eventually stabilize and slowly convergeto 0.

– VIF’s tend to fall quickly as c moves away fromzero and then change only moderately after that

• Choose c where things tend to “stabilize”

• SAS has option ridge=c;

16-12

Page 16: Multicollinearity Diagnostics Remedies

SAS Commands

options nocenter ls=72;goptions colors=(’none’);data a1;

infile ’C:\datasets\Ch07ta01.txt’;input skinfold thigh midarm fat;

proc print data=a1;run;

proc reg data=a1 outest=bfout;model fat=skinfold thigh midarm /ridge=0 to .1 by .003 outvif;

run;

symbol1 v=’’ i=sm5 l=1;symbol2 v=’’ i=sm5 l=2;symbol3 v=’’ i=sm5 l=3;proc gplot;plot (skinfold thigh midarm)*_ridge_ / overlay vref=0;

run;quit;

16-13

Page 17: Multicollinearity Diagnostics Remedies

Ridge Trace

16-5

Each value of c gives different values of the pa-

rameter estimates. (Note the instability of the

estimate values for small c.)

16-14

Page 18: Multicollinearity Diagnostics Remedies

graph VIF:

16-15

Page 19: Multicollinearity Diagnostics Remedies

Obs _RIDGE_ skinfold thigh midarm

2 0.000 708.843 564.343 104.6064 0.002 50.559 40.448 8.2806 0.004 16.982 13.725 3.3638 0.006 8.503 6.976 2.119

10 0.008 5.147 4.305 1.62412 0.010 3.486 2.981 1.37714 0.012 2.543 2.231 1.23616 0.014 1.958 1.764 1.14618 0.016 1.570 1.454 1.08620 0.018 1.299 1.238 1.04322 0.020 1.103 1.081 1.01124 0.022 0.956 0.963 0.98626 0.024 0.843 0.872 0.96628 0.026 0.754 0.801 0.94930 0.028 0.683 0.744 0.93532 0.030 0.626 0.697 0.923

Note that at RIDGE = 0.020, the VIF’s are close to 1.

16-16

Page 20: Multicollinearity Diagnostics Remedies

Obs _RIDGE_ _RMSE_ Intercept skinfold thigh midarm

3 0.000 2.47998 117.085 4.33409 -2.85685 -2.186065 0.002 2.54921 22.277 1.46445 -0.40119 -0.673817 0.004 2.57173 7.725 1.02294 -0.02423 -0.440839 0.006 2.58174 1.842 0.84372 0.12820 -0.34604

11 0.008 2.58739 -1.331 0.74645 0.21047 -0.2944313 0.010 2.59104 -3.312 0.68530 0.26183 -0.2618515 0.012 2.59360 -4.661 0.64324 0.29685 -0.2393417 0.014 2.59551 -5.637 0.61249 0.32218 -0.2227819 0.016 2.59701 -6.373 0.58899 0.34131 -0.2100421 0.018 2.59822 -6.946 0.57042 0.35623 -0.1999123 0.020 2.59924 -7.403 0.55535 0.36814 -0.1916325 0.022 2.60011 -7.776 0.54287 0.37786 -0.1847027 0.024 2.60087 -8.083 0.53233 0.38590 -0.1788129 0.026 2.60156 -8.341 0.52331 0.39265 -0.1737231 0.028 2.60218 -8.559 0.51549 0.39837 -0.1692633 0.030 2.60276 -8.746 0.50864 0.40327 -0.16531

Note that at RIDGE = 0.020, the RMSE is only in-creased by 5% (so SSE increase by about 10%), andthe parameter estimates are closer to making sense.

16-17

Page 21: Multicollinearity Diagnostics Remedies

Conclusion

So the solution at c= 0.02 with parameter esti-

mates (-7.4, 0.56, 0.37, -0.19) seems to make

the most sense.

Notes:

• The book makes a big deal about stan-

dardizing the variables... SAS does this for

you in the ridge option.

• Why ridge regression? Estimates tend to

be more stable, particularly outside the re-

gion of the predictor variables: less affected

by small changes in the data. (Ordinary

LS estimates can be highly unstable when

there is lots of multicollinearity.)

16-18

Page 22: Multicollinearity Diagnostics Remedies

• Major drawback: ordinary inference proce-

dures don’t work so well.

• Other procedures use different penalties,

e.g. “Lasso” penalty.

Page 23: Multicollinearity Diagnostics Remedies

Background Reading

• KNN sections 10.5 and 11.2

• ridge.sas

16-19