14_week5lect2b

29
2 MATH2831/2931 Linear Models/ Higher Linear Models. August 28, 2013

Upload: matthew

Post on 21-Dec-2015

5 views

Category:

Documents


1 download

DESCRIPTION

dvf

TRANSCRIPT

Page 1: 14_Week5Lect2b

2

MATH2831/2931

Linear Models/ Higher Linear Models.

August 28, 2013

Page 2: 14_Week5Lect2b

3

Week 5 Lecture 2 - Last lecture:

◮ Hypothesis testing in the general linear model

◮ Testing on a subvector of β

◮ Sequential tests

◮ Multicollinearity

Page 3: 14_Week5Lect2b

4

Week 5 Lecture 2 - This lecture:

◮ Model selection.

◮ Complexity/Goodness of fit trade off

◮ PRESS residuals and the PRESS statistic

◮ The hat matrix and computation of PRESS residuals.

◮ Examples: cheddar cheese tastings.

Page 4: 14_Week5Lect2b

5

Week 5 Lecture 2 - Model selection

◮ Many reasons for building a statistical model, different reasonsrelate to different decisions to be made.

◮ REMEMBER: We need to manage a trade off betweencomplexity and goodness of fit for providing a reduction of thedata useful for decision making.

◮ General linear model: problem of model selection appearsthrough the decision about which predictor variables toinclude.

◮ Why do model selection in any case?

◮ Often data are collected for a large number of predictors,some of which might be unrelated to the response.

◮ Summarization, decide which predictors are important,prediction.

Page 5: 14_Week5Lect2b

6

Week 5 Lecture 2 -Predictive criteria

◮ Prediction is perhaps the most common reason forbuilding a statistical model.

◮ QUESTION: Is it harmful for prediction if we fit a statisticalmodel which is more complicated than we really need?

◮ Fitting an unnecessarily complicated model can beharmful: we consider the most elementary case, the simplelinear regression model.

◮ Responses y1, ..., yn corresponding predictor values x1, ..., xn.

◮ Simple regression model holds,

yi = β0 + β1xi + ǫi

ǫi i = 1, ..., n are a collection of zero mean errors uncorrelatedwith common variance σ2 say.

Page 6: 14_Week5Lect2b

7

Week 5 Lecture 2 -Predictive criteria

◮ Write M0 for the model

yi = β0 + ǫi

in which the predictor xi is excluded

◮ Write M1 for the full model

yi = β0 + β1xi + ǫi .

◮ Fit these two models to the data.

◮ Evaluation of predictive performance: look at expectedsquared prediction error for a new observation y∗ withcorresponding predictor value x∗.

◮ Which model is better?

Page 7: 14_Week5Lect2b

8

Week 5 Lecture 2 -Predictive criteria◮ Consider first model M0 which involves just an intercept.◮ To estimate β0, minimize

n∑

i=1

(yi − β0)2.

Differentiating,

−2n∑

i=1

(yi − β0).

Least squares estimator b0 of β0 satisfiesn∑

i=1

(yi − b0) = 0

from which we haven∑

i=1

yi = nb0 or b0 = y .

Page 8: 14_Week5Lect2b

9

Week 5 Lecture 2 -Predictive criteria

◮ For M0, our prediction of a new observation y∗ from thefitted model is y

◮ Notation: write y0(x∗) for the the predicted value of y∗ formodel M0 when the predictor is x∗.

◮ For M1, write y1(x∗) for the predicted value of y∗ when thepredictor is x∗

y1(x∗) = b0 + b1x∗

where b0 and b1 are the least squares estimators of β0 and β1

Page 9: 14_Week5Lect2b

10

Week 5 Lecture 2 -Predictive criteria

◮ ConsiderE((y∗ − y0(x∗))

2)

andE((y∗ − y1(x∗))

2).

For a random variable Z , Var(Z ) = E(Z 2)− E(Z )2 so that

E(Z 2) = Var(Z ) + E(Z )2.

Apply this to y∗ − y0(x∗),EXERCISE!

Page 10: 14_Week5Lect2b

11

Week 5 Lecture 2 -Predictive criteria

E((y∗ − y0(x∗))2) = Var(y∗ − y0(x∗)) + E(y∗ − y0(x∗))

2

= Var(y∗) +Var(y0(x∗)) + E(y∗ − y0(x∗))2

= σ2 + Var(y0(x∗)) + E(y∗ − y0(x∗))2.

Now derive

Var(y0(x∗)) = Var(y) = Var(1

n

n∑

i=1

yi )

=1

n2

n∑

i=1

Var(yi ) assume yi independent

E((y∗ − y0(x∗))2) = σ2 +

σ2

n+ E(y∗ − y0(x∗))

2.

Page 11: 14_Week5Lect2b

12

Week 5 Lecture 2 -Predictive criteria

Also

E((y∗ − y1(x∗))2) = σ2 +Var(y1(x∗)) + E(y∗ − y1(x∗))

2.

Now recall

Var(y(x∗)) = σ2

(1

n+

(x∗ − x)2

Sxx

).

E((y∗ − y1(x∗))2) = σ2 + σ2

(1

n+

(x∗ − x)2

Sxx

)+ E(y∗ − y1(x∗))

2.

Page 12: 14_Week5Lect2b

13

Week 5 Lecture 2 -Predictive criteria

◮ Expressions are easily interpreted: error variance +variance of prediction + squared bias of prediction.

◮ We expect variance of prediction to be larger for the morecomplex model

◮ Variance of prediction reflects uncertainty due to estimation ofparameters, which is larger for the model with moreparameters.

◮ But - if there is bias of prediction due to ommission of thepredictor in M0, the squared bias term may be large for thesmaller model.

◮ Bias/variance trade off: goodness of fit/complexity tradeoff

Page 13: 14_Week5Lect2b

14

Week 5 Lecture 2 - Predictive criteria◮ Variance of prediction

(regardless of whether model M0 or M1 holds): from previouslectures,

Var(y1(x∗)) = σ2

(1

n+

(x∗ − x)2

Sxx

)

Also,

Var(y0(x∗)) =σ2

n.

Prediction variance is larger for the more complex model.◮ However, prediction bias may be less for the more

complex model if the predictor is important.◮ Model selection involves managing a trade off between bias

and variance contributions to expected squared predictionerror.

Page 14: 14_Week5Lect2b

15

Week 5 Lecture 2 - Out of sample prediction

◮ Problem with using the same data both for model fitting andmodel evaluation.

◮ DEFINE: Out-of-sample prediction errors as prediction errorfor new observations which were not used in fitting the model

◮ RECALL: Within-sample prediction errors are given by theresiduals!

◮ Within sample prediction errors (residuals) will typically besmaller than so-called out of sample prediction errors since wehave chosen the parameters in the fitted model by minimizinga measure of within sample prediction error

◮ This motivates a new kind of residual and a statistic based onthese residuals which can be used for model selection.

Page 15: 14_Week5Lect2b

16

Week 5 Lecture 2 - PRESS residuals

RECALL: ǫ - errors from linear model are assumed to beunobservable random variables with zero mean, common varianceσ2 and uncorrelated with a Gaussian distribution.RECALL: e residuals are computed quantities that can be graphedand studied

◮ Write yi ,−i for the forecast of the ith observation obtained byfitting a model to the data using all observations except theith.

◮ DEFINE: ith PRESS residual as

ei ,−i = yi − yi ,−i

Page 16: 14_Week5Lect2b

17

Week 5 Lecture 2 - The hat matrix◮ Global measure of goodness of fit based on the PRESS

residuals is the PRESS statistic

PRESS =n∑

i=1

e2i ,−i .

◮ PRESS statistic does not necessarily decrease as we make themodel more complex.

◮ Computation of the PRESS statistic seems difficult: adirect method of computation involves deleting oneobservation at a time, fitting a different model each time.

◮ PRESS residuals can be written in terms of the ordinaryresiduals and so-called leverages.

◮ DEFINITION: Leverages are the diagonal elements ofthe so-called hat matrix H,

H = X (XTX )−1XT .

Page 17: 14_Week5Lect2b

18

Week 5 Lecture 2 - The hat matrix

Leverage for the ith observation hii : apart from σ2, the variance ofprediction at the ith predictor value xi is

hii =1

n+

(xi − x)2

Sxx.

The ith leverage is large if xi is far from x (the predictor is extremein the space of predictors).Show that hij can be written as:

hii =1

n+

(xi − x)2

Sxx

and

hij =1

n+

(xi − x)(xj − x)

Sxx

EXERCISE: Work in setting of simple linear model.

Page 18: 14_Week5Lect2b

19

Week 5 Lecture 2 - The hat matrix

hij =[1 xi

][ ∑

x2inSxx

−xSxx

−xSxx

1

Sxx

] [1xi

]

=1

n+

(xi − x)(xj − x)

Sxx

Page 19: 14_Week5Lect2b

20

Week 5 Lecture 2 - The hat matrix

Show that hij = hji for any i , j

EXERCISE:

hij = xTi (XTX )−1xj = xTj (XTX )−1xi = hji

Page 20: 14_Week5Lect2b

21

Week 5 Lecture 2 - The hat matrix

The hat matrix H is the”orthogonal projection operator on the column space of X”

BACKGROUND TO RECALL ON LINEAR ALGEBRA:

1. Let W be a vector space. Suppose the subspaces U and V arethe range and null space of H respectively.

2. The null space is the set of vectors that map to the zerovector.

3. The column space of a matrix X is the set of all possible linearcombinations of its column vectors.

4. A projection is a linear transformation H from a vector spaceto itself such that H2 = H

5. An orthogonal projection is a projection for which the range Uand the null space V are orthogonal subspaces.

Page 21: 14_Week5Lect2b

22

Week 5 Lecture 2 - The hat matrix

These properties mean that one can show the following about H:

1. is idempotent ie. HH = H2 = H

2. HX = X

3. (I − H)X = 0

4. H(I − H) = 0

5.∑n

i=1hii = rank(X )

Page 22: 14_Week5Lect2b

23

Week 5 Lecture 2 - The hat matrix

◮ Interpretation of the hat matrix:write y for the vector of fitted values.

y = Xb = X (XTX )−1XT y = Hy .

Leverage hii (ith diagonal element) multiplies yi indetermining yi .

Page 23: 14_Week5Lect2b

24

Week 5 Lecture 2 - Leverages

◮ Leverages depend only on X : think of hii as measuring theinfluence of yi on yi by virtue of where the predictors are.

◮ Alternative motivation for the leverages:Write xi = (1, xi1, ..., xik)⊤, then the ith diagonal element ofH (ith leverage value) is

hii = xTi (XTX )−1xi .

Var(y(xi )) = σ2hii .

So the ith leverage is (apart from σ2) simply the predictionvariance for the fitted model at xi .EXERCISE:

Var(y(xi )) = σ2hii .

Page 24: 14_Week5Lect2b

25

Week 5 Lecture 2 - Leverages

e = Y − Y

= Y − X (X⊤X )−1X⊤Y

=[I − X (X⊤X )−1X⊤

]Y

= [I − H]Y

Now take the expectation and variance

E[e] = 0

Var[e] = Var([I−H]Y ) = (I−H)σ2I (I−H)⊤ = (I−H)2σ2 = σ2(I−H)

Since H is idempotent so I − H is also idempotent.

Page 25: 14_Week5Lect2b

26

Week 5 Lecture 2 - Computational formula for PRESSresiduals

◮ Press residual for the ith observation ei ,−i .

◮ Write ei for the ordinary residual

ei = yi − yi .

It can be shown that

ei ,−i =ei

1− hii.

So computation of PRESS residuals involves computingresiduals for the model fitted to all the data, and determiningthe diagonal elements of the hat matrix for the model fittedto all the data.

◮ No need to “delete and recompute”.

Page 26: 14_Week5Lect2b

27

Week 5 Lecture 2 - Review of PRESS◮ PRESS statistic: USEFUL WHEN COMPARING

COMPETING MODELS FOR PURPOSE OF PREDICTIONi.e. comparing competing models when the goal is predictionof future values of the response.

◮ Fit the model

yi = β0 + β1xi1 + ...+ βkxik + ǫi .

The residualsei = yi − yi

are typically smaller in magnitude than the errors ǫi (since theparameters are chosen to minimize the sum of squared ei ’s.)

◮ PRESS residualsei ,−i = yi − yi ,−i

where yi ,−i is obtained from the data with the ith observationdeleted.

◮ The PRESS statistic is the sum of squared PRESS residuals.

Page 27: 14_Week5Lect2b

28

Week 5 Lecture 2 - Computational formula for PRESSresiduals

◮ Press residual for ith observation

ei ,−i =ei

1− hii.

Computation of PRESS residuals involves computing residualsand leverages for a model fitted to all the data.

◮ Use of PRESS statistic for model selection.

◮ Example: data on cheddar cheese tastings

◮ Response a subjective measure of cheese taste, predictorsmeasures of concentration of lactic acid (lactic) acetic acid(acetic) and hydrogen sulfide (H2S).

Page 28: 14_Week5Lect2b

29

Week 5 Lecture 2 - Example: data on cheddar cheesetastings

Model R2 R2 PRESS

H2S 0.571 0.556 3688.08lactic 0.496 0.478 4375.64acetic 0.302 0.277 6111.26H2S, lactic 0.652 0.626 3135.44H2S, acetic 0.582 0.551 3877.62lactic, acetic 0.520 0.485 4535.47H2S, lactic, acetic 0.652 0.612 3402.24

Page 29: 14_Week5Lect2b

30

Week 5 Lecture 2 - Learning Expectations.

◮ Understand the difficulties in performing model selection !

◮ Understand the Bias/Variance trade-off in terms of modelcomplexity/Quality of Fit (goodness-of-fit)

◮ Know how to formulate, interpret and calculate PRESSresiduals and the PRESS statistic

◮ Understand the importance of the hat matrix in thecomputation of PRESS residuals.

◮ Hopefully appreciate what makes good cheddar cheese :).