14_week5lect2b
DESCRIPTION
dvfTRANSCRIPT
2
MATH2831/2931
Linear Models/ Higher Linear Models.
August 28, 2013
3
Week 5 Lecture 2 - Last lecture:
◮ Hypothesis testing in the general linear model
◮ Testing on a subvector of β
◮ Sequential tests
◮ Multicollinearity
4
Week 5 Lecture 2 - This lecture:
◮ Model selection.
◮ Complexity/Goodness of fit trade off
◮ PRESS residuals and the PRESS statistic
◮ The hat matrix and computation of PRESS residuals.
◮ Examples: cheddar cheese tastings.
5
Week 5 Lecture 2 - Model selection
◮ Many reasons for building a statistical model, different reasonsrelate to different decisions to be made.
◮ REMEMBER: We need to manage a trade off betweencomplexity and goodness of fit for providing a reduction of thedata useful for decision making.
◮ General linear model: problem of model selection appearsthrough the decision about which predictor variables toinclude.
◮ Why do model selection in any case?
◮ Often data are collected for a large number of predictors,some of which might be unrelated to the response.
◮ Summarization, decide which predictors are important,prediction.
6
Week 5 Lecture 2 -Predictive criteria
◮ Prediction is perhaps the most common reason forbuilding a statistical model.
◮ QUESTION: Is it harmful for prediction if we fit a statisticalmodel which is more complicated than we really need?
◮ Fitting an unnecessarily complicated model can beharmful: we consider the most elementary case, the simplelinear regression model.
◮ Responses y1, ..., yn corresponding predictor values x1, ..., xn.
◮ Simple regression model holds,
yi = β0 + β1xi + ǫi
ǫi i = 1, ..., n are a collection of zero mean errors uncorrelatedwith common variance σ2 say.
7
Week 5 Lecture 2 -Predictive criteria
◮ Write M0 for the model
yi = β0 + ǫi
in which the predictor xi is excluded
◮ Write M1 for the full model
yi = β0 + β1xi + ǫi .
◮ Fit these two models to the data.
◮ Evaluation of predictive performance: look at expectedsquared prediction error for a new observation y∗ withcorresponding predictor value x∗.
◮ Which model is better?
8
Week 5 Lecture 2 -Predictive criteria◮ Consider first model M0 which involves just an intercept.◮ To estimate β0, minimize
n∑
i=1
(yi − β0)2.
Differentiating,
−2n∑
i=1
(yi − β0).
Least squares estimator b0 of β0 satisfiesn∑
i=1
(yi − b0) = 0
from which we haven∑
i=1
yi = nb0 or b0 = y .
9
Week 5 Lecture 2 -Predictive criteria
◮ For M0, our prediction of a new observation y∗ from thefitted model is y
◮ Notation: write y0(x∗) for the the predicted value of y∗ formodel M0 when the predictor is x∗.
◮ For M1, write y1(x∗) for the predicted value of y∗ when thepredictor is x∗
y1(x∗) = b0 + b1x∗
where b0 and b1 are the least squares estimators of β0 and β1
10
Week 5 Lecture 2 -Predictive criteria
◮ ConsiderE((y∗ − y0(x∗))
2)
andE((y∗ − y1(x∗))
2).
For a random variable Z , Var(Z ) = E(Z 2)− E(Z )2 so that
E(Z 2) = Var(Z ) + E(Z )2.
Apply this to y∗ − y0(x∗),EXERCISE!
11
Week 5 Lecture 2 -Predictive criteria
E((y∗ − y0(x∗))2) = Var(y∗ − y0(x∗)) + E(y∗ − y0(x∗))
2
= Var(y∗) +Var(y0(x∗)) + E(y∗ − y0(x∗))2
= σ2 + Var(y0(x∗)) + E(y∗ − y0(x∗))2.
Now derive
Var(y0(x∗)) = Var(y) = Var(1
n
n∑
i=1
yi )
=1
n2
n∑
i=1
Var(yi ) assume yi independent
E((y∗ − y0(x∗))2) = σ2 +
σ2
n+ E(y∗ − y0(x∗))
2.
12
Week 5 Lecture 2 -Predictive criteria
Also
E((y∗ − y1(x∗))2) = σ2 +Var(y1(x∗)) + E(y∗ − y1(x∗))
2.
Now recall
Var(y(x∗)) = σ2
(1
n+
(x∗ − x)2
Sxx
).
E((y∗ − y1(x∗))2) = σ2 + σ2
(1
n+
(x∗ − x)2
Sxx
)+ E(y∗ − y1(x∗))
2.
13
Week 5 Lecture 2 -Predictive criteria
◮ Expressions are easily interpreted: error variance +variance of prediction + squared bias of prediction.
◮ We expect variance of prediction to be larger for the morecomplex model
◮ Variance of prediction reflects uncertainty due to estimation ofparameters, which is larger for the model with moreparameters.
◮ But - if there is bias of prediction due to ommission of thepredictor in M0, the squared bias term may be large for thesmaller model.
◮ Bias/variance trade off: goodness of fit/complexity tradeoff
14
Week 5 Lecture 2 - Predictive criteria◮ Variance of prediction
(regardless of whether model M0 or M1 holds): from previouslectures,
Var(y1(x∗)) = σ2
(1
n+
(x∗ − x)2
Sxx
)
Also,
Var(y0(x∗)) =σ2
n.
Prediction variance is larger for the more complex model.◮ However, prediction bias may be less for the more
complex model if the predictor is important.◮ Model selection involves managing a trade off between bias
and variance contributions to expected squared predictionerror.
15
Week 5 Lecture 2 - Out of sample prediction
◮ Problem with using the same data both for model fitting andmodel evaluation.
◮ DEFINE: Out-of-sample prediction errors as prediction errorfor new observations which were not used in fitting the model
◮ RECALL: Within-sample prediction errors are given by theresiduals!
◮ Within sample prediction errors (residuals) will typically besmaller than so-called out of sample prediction errors since wehave chosen the parameters in the fitted model by minimizinga measure of within sample prediction error
◮ This motivates a new kind of residual and a statistic based onthese residuals which can be used for model selection.
16
Week 5 Lecture 2 - PRESS residuals
RECALL: ǫ - errors from linear model are assumed to beunobservable random variables with zero mean, common varianceσ2 and uncorrelated with a Gaussian distribution.RECALL: e residuals are computed quantities that can be graphedand studied
◮ Write yi ,−i for the forecast of the ith observation obtained byfitting a model to the data using all observations except theith.
◮ DEFINE: ith PRESS residual as
ei ,−i = yi − yi ,−i
17
Week 5 Lecture 2 - The hat matrix◮ Global measure of goodness of fit based on the PRESS
residuals is the PRESS statistic
PRESS =n∑
i=1
e2i ,−i .
◮ PRESS statistic does not necessarily decrease as we make themodel more complex.
◮ Computation of the PRESS statistic seems difficult: adirect method of computation involves deleting oneobservation at a time, fitting a different model each time.
◮ PRESS residuals can be written in terms of the ordinaryresiduals and so-called leverages.
◮ DEFINITION: Leverages are the diagonal elements ofthe so-called hat matrix H,
H = X (XTX )−1XT .
18
Week 5 Lecture 2 - The hat matrix
Leverage for the ith observation hii : apart from σ2, the variance ofprediction at the ith predictor value xi is
hii =1
n+
(xi − x)2
Sxx.
The ith leverage is large if xi is far from x (the predictor is extremein the space of predictors).Show that hij can be written as:
hii =1
n+
(xi − x)2
Sxx
and
hij =1
n+
(xi − x)(xj − x)
Sxx
EXERCISE: Work in setting of simple linear model.
19
Week 5 Lecture 2 - The hat matrix
hij =[1 xi
][ ∑
x2inSxx
−xSxx
−xSxx
1
Sxx
] [1xi
]
=1
n+
(xi − x)(xj − x)
Sxx
20
Week 5 Lecture 2 - The hat matrix
Show that hij = hji for any i , j
EXERCISE:
hij = xTi (XTX )−1xj = xTj (XTX )−1xi = hji
21
Week 5 Lecture 2 - The hat matrix
The hat matrix H is the”orthogonal projection operator on the column space of X”
BACKGROUND TO RECALL ON LINEAR ALGEBRA:
1. Let W be a vector space. Suppose the subspaces U and V arethe range and null space of H respectively.
2. The null space is the set of vectors that map to the zerovector.
3. The column space of a matrix X is the set of all possible linearcombinations of its column vectors.
4. A projection is a linear transformation H from a vector spaceto itself such that H2 = H
5. An orthogonal projection is a projection for which the range Uand the null space V are orthogonal subspaces.
22
Week 5 Lecture 2 - The hat matrix
These properties mean that one can show the following about H:
1. is idempotent ie. HH = H2 = H
2. HX = X
3. (I − H)X = 0
4. H(I − H) = 0
5.∑n
i=1hii = rank(X )
23
Week 5 Lecture 2 - The hat matrix
◮ Interpretation of the hat matrix:write y for the vector of fitted values.
y = Xb = X (XTX )−1XT y = Hy .
Leverage hii (ith diagonal element) multiplies yi indetermining yi .
24
Week 5 Lecture 2 - Leverages
◮ Leverages depend only on X : think of hii as measuring theinfluence of yi on yi by virtue of where the predictors are.
◮ Alternative motivation for the leverages:Write xi = (1, xi1, ..., xik)⊤, then the ith diagonal element ofH (ith leverage value) is
hii = xTi (XTX )−1xi .
Var(y(xi )) = σ2hii .
So the ith leverage is (apart from σ2) simply the predictionvariance for the fitted model at xi .EXERCISE:
Var(y(xi )) = σ2hii .
25
Week 5 Lecture 2 - Leverages
e = Y − Y
= Y − X (X⊤X )−1X⊤Y
=[I − X (X⊤X )−1X⊤
]Y
= [I − H]Y
Now take the expectation and variance
E[e] = 0
Var[e] = Var([I−H]Y ) = (I−H)σ2I (I−H)⊤ = (I−H)2σ2 = σ2(I−H)
Since H is idempotent so I − H is also idempotent.
26
Week 5 Lecture 2 - Computational formula for PRESSresiduals
◮ Press residual for the ith observation ei ,−i .
◮ Write ei for the ordinary residual
ei = yi − yi .
It can be shown that
ei ,−i =ei
1− hii.
So computation of PRESS residuals involves computingresiduals for the model fitted to all the data, and determiningthe diagonal elements of the hat matrix for the model fittedto all the data.
◮ No need to “delete and recompute”.
27
Week 5 Lecture 2 - Review of PRESS◮ PRESS statistic: USEFUL WHEN COMPARING
COMPETING MODELS FOR PURPOSE OF PREDICTIONi.e. comparing competing models when the goal is predictionof future values of the response.
◮ Fit the model
yi = β0 + β1xi1 + ...+ βkxik + ǫi .
The residualsei = yi − yi
are typically smaller in magnitude than the errors ǫi (since theparameters are chosen to minimize the sum of squared ei ’s.)
◮ PRESS residualsei ,−i = yi − yi ,−i
where yi ,−i is obtained from the data with the ith observationdeleted.
◮ The PRESS statistic is the sum of squared PRESS residuals.
28
Week 5 Lecture 2 - Computational formula for PRESSresiduals
◮ Press residual for ith observation
ei ,−i =ei
1− hii.
Computation of PRESS residuals involves computing residualsand leverages for a model fitted to all the data.
◮ Use of PRESS statistic for model selection.
◮ Example: data on cheddar cheese tastings
◮ Response a subjective measure of cheese taste, predictorsmeasures of concentration of lactic acid (lactic) acetic acid(acetic) and hydrogen sulfide (H2S).
29
Week 5 Lecture 2 - Example: data on cheddar cheesetastings
Model R2 R2 PRESS
H2S 0.571 0.556 3688.08lactic 0.496 0.478 4375.64acetic 0.302 0.277 6111.26H2S, lactic 0.652 0.626 3135.44H2S, acetic 0.582 0.551 3877.62lactic, acetic 0.520 0.485 4535.47H2S, lactic, acetic 0.652 0.612 3402.24
30
Week 5 Lecture 2 - Learning Expectations.
◮ Understand the difficulties in performing model selection !
◮ Understand the Bias/Variance trade-off in terms of modelcomplexity/Quality of Fit (goodness-of-fit)
◮ Know how to formulate, interpret and calculate PRESSresiduals and the PRESS statistic
◮ Understand the importance of the hat matrix in thecomputation of PRESS residuals.
◮ Hopefully appreciate what makes good cheddar cheese :).