多变量回归分析 multiple regression.pdf

7/30/2019 Multiple Regression.pdf

1/23

Chapter 3

Multiple Regression

3.1 Multiple Linear Regression Model

A fitted linear regression model always leaves some residual variation. There

might be another systematic cause for the variability in the observations yi. If wehave data on other explanatory variables we can ask whether they can be used to

explain some of the residual variation in Y. If this is a case, we should take it intoaccount in the model, so that the errors are purely random. We could write

Yi = 0 + 1xi + 2zi + i

previously i

.

Z is another explanatory variable. Usually, we denote all explanatory variables(there may be more than two of them) using letter X with an index to distinguishbetween them, i.e., X1, X2, . . . , X p1.

Example 3.1. (Neter at al, 1996) Dwine Studios, Inc.The company operates portrait studios in 21 cities of medium size. These studios

specialize in portraits of children. The company is considering an expansion into

other cities of medium size and wishes to investigate whether sales (Y) in a com-munity can be predicted from the number of persons aged 16 or younger in the

community (X1) and the per capita disposable personal income in the community(X2).

If we use just X2 (per capita disposable personal income in the community) tomodel Y (sales in the community) we obtain the following model fit.

57


2/23

58 CHAPTER 3. MULTIPLE REGRESSION

The regression equation is

Y = - 352.5 + 31.17 X2

S = 20.3863 R-Sq = 69.9% R-Sq(adj) = 68.3%

Analysis of Variance

Source DF SS MS F P

Regression 1 18299.8 18299.8 44.03 0.000

Error 19 7896.4 415.6

Total 20 26196.2

(a) (b)

Figure 3.1: (a) Fitted line plot for Dwine Studios versus per capita disposable personal incomein the community. (b) Residual plots.

The regression is highly significant, but R2 is rather small. It suggests that there could be some

other factors, which are also important for the sales. We have data on the number of persons aged

16 or younger in the community, so we will examine whether the residuals of the above fit are

related to this variable. If yes, then including it in the model may improve the fit.

Figure 3.2: The dependence of the residuals on X1.


3/23

3.1. MULTIPLE LINEAR REGRESSION MODEL 59

Indeed, the residuals show a possible relationship with the number of persons aged 16 or younger

in the community. We will fit the model with both variables, X1 and X2 included, that is

Yi = 0 + 1x1i + 1x2i + i, i = 1, . . . , n .

The model fit is following


Y = - 68.9 + 1.45 X1 + 9.37 X2

Predictor Coef SE Coef T P

Constant -68.86 60.02 -1.15 0.266

X1 1.4546 0.2118 6.87 0.000

X2 9.366 4.064 2.30 0.033

S = 11.0074 R-Sq = 91.7% R-Sq(adj) = 90.7%


Source DF SS MS F P

Regression 2 24015 12008 99.10 0.000

Residual Error 18 2181 121

Total 20 26196

Here we see that the intercept parameter is not significantly different from zero (p = 0.226) and

so the model without the intercept was fitted. R2

is now close to 100% and both parameters arehighly significant.

Regression Equation

Y = 1.62 X1 + 4.75 X2

Coefficients

Term Coef SE Coef T P

X1 1.62175 0.154948 10.4664 0.000

X2 4.75042 0.583246 8.1448 0.000

S = 11.0986 R-Sq = 99.68% R-Sq(adj) = 99.64%


Source DF Seq SS Adj SS Adj MS F P

Regression 2 718732 718732 359366 2917.42 0.000

Error 19 2340 2340 123

Total 21 721072


4/23


Figure 3.3: Fitted surface plot and the Dwine Studios observations.

A Multiple Linear Regression (MLR) model for a response variable Y and ex-planatory variables X1, X2, . . . , X p1 is

E(Y|X1 = x1i, . . . , X p1 = xp1i) = 0 + 1x1i + . . . + p1xp1ivar(Y|X1 = x1i, . . . , X p1 = xp1i) =

2, i = 1, . . . , n

cov(Y|X1 = x1i,..,Xp1 = xp1i, Y|X1 = x1j,..,Xp1 = xp1j ) = 0, i = jAs in the SLR model we denote

Yi = (Y|X1 = x1i, . . . , X p1 = xp1i)

and we usually omit the condition on Xs and write

i = E(Yi) = 0 + 1x1i + . . . + p1xp1i

var(Yi) = 2, i = 1, . . . , n

cov(Yi, Yj) = 0, i = j

orYi = 0 + 1x1i + . . . + p1xp1i + i

E(i) = 0

var(i) = 2, i = 1, . . . , n

cov(i, j ) = 0, i = j

For testing we need the assumption of Normality, i.e., we assume that

Yi ind

N(i, 2)


5/23

3.2. LEAST SQUARES ESTIMATION 61

or

i ind N(0, 2)

To simplify the notation we write the MLR model in a matrix form

Y= X + , (3.1)

that is,

Y1Y2...

Yn

:= Y

=

1 x1,1 xp1,11 x1,2 xp1,2...

... ...

1 x1

,n xp1

,n

:= X

01...

p1

:=

+

12...

n

:=

Here Y is the vector of responses, X is often called the design matrix, is the

vector of unknown, constant parameters and is the vector of random errors.

i are independent and identically distributed, that is

Nn(0n, 2In).

Note that the properties of the errors give

Y Nn(X, 2In).

3.2 Least squares estimation

To derive the least squares estimator (LSE) for the parameter vector we min-

imise the sum of squares of the errors, that is

S() =

n

i=1 [Yi {0 + 1x1,i + + p1xp1,i}]2

=

2i

= T

= (YX)T(YX)

= (YT TXT)(YX)

= YTY YTX TXTY+ TXTX

= YTY 2TXTY+ TXTX.


6/23


Theorem 3.1. The LSE

of is given by

= (XTX)1XTYifXTX is non-singular. IfXTX is singular there is no unique LSE of.

Proof. Let 0 be any solution ofXTX = XTY. Then, XTX0 = X

TYand

S() S(0)

= YTY 2TXTY+ TXTX YTY+ 2T0XTY T0X

TX0

= 2TXTX0 + TXTX + 2T0X

TX0 T0X

TX0

= TXTX 2TXTX0 + T0X

TX0

= TXTX TXTX0 TXTX0 +

T0X

TX0

= TXTX TXTX0 T0X

TX + T0XTX0

= T(XTX XTX0) T0 (X

TX XTX0)

= (T T0 )(XTX XTX0)

= (T T0 )XTX( 0)

= {X( 0)}T{X( 0)} 0

since it is a sum of squares of elements of the vector X( 0).

We have shown that S() S(0) 0.

Hence, 0 minimises S(), i.e. any solution ofXTX = XTY minimises

S().

IfXTX is nonsingular the unique solution is = (XTX)1XTY.IfXTX is singular there is no unique solution.


7/23

3.2. LEAST SQUARES ESTIMATION 63

Note that, as we did for the SLM in Chapter 2, it is possible to obtain this result

by differentiating S() with respect to and setting it equal to 0.

3.2.1 Properties of the least squares estimator

Theorem 3.2. If

Y= X + , Nn(0, 2I),

then

Np(, 2(XTX)1).

Proof. Each element of is a linear function of Y1, . . . , Y n. We assume thatYi, i = 1, . . . , n are normally distributed. Hence is also normally distributed.The expectation and variance-covariance matrix can be shown in the same way as

in Theorem 2.7.

Remark 3.1. The vector of fitted values is given by

= Y= X= X(XTX)1XTY

= HY.

The matrix H= X(XTX)1XT is called the hat matrix.

Note that

HT = H

and also

HH = X(XTX)1XTX(XTX)1 =I

XT

= X(XTX)1XT

= H.

A matrix, which satisfies the condition AA = A is called an idempotent matrix.Note that ifA is idempotent, then (I A) is also idempotent.


8/23


We now prove some results about the residual vector

e = Y Y= YHY

= (I H)Y.

As in Theorem 2.8, here we have

Lemma 3.1. E(e) = 0.

Proof.

E(e) = (I H) E(Y)= (I X(XTX)1XT)X

= X X

= 0

Lemma 3.2. Var(e) = 2(I H).

Proof.

Var(e) = (IH) var(Y)(I H)T

= (IH)2I(I H)

= 2(I H)

Lemma 3.3. The sum of squares of the residuals is YT(I H)Y.

Proof.

n

i=1 e2i = e

Te = YT(I H)T(IH)Y

= YT(I H)Y

Lemma 3.4. The elements of the residual vectore sum to zero, i.e

ni=1

ei = 0.


9/23

3.3. ANALYSIS OF VARIANCE 65

Proof. We will prove this by contradiction.

Assume that ei = nc where c = 0. Then

e2i =

{(ei c) + c}2

=

(ei c)2 + 2c

(ei c) + nc

2

=

(ei c)2 + 2c(

ei

=nc

nc) + nc2

=

(ei c)2 + nc2

> (ei c)2.But we know that

e2i is the minimum value ofS() so there cannot exist values

with a smaller sum of squares and this gives the required contradiction. So c = 0.

Corollary 3.1.

1

n

ni=1

Yi = Y .

Proof. The residuals ei = Yi Yi, so ei = (Yi Yi) but ei = 0. HenceYi =

Yi and so the result follows.

3.3 Analysis of Variance

We begin this section by proving the basic Analysis of Variance identity.

Theorem 3.3. The total sum of squares splits into the regression sum of squares

and the residual sum of squares, that is

SST = SSR + SSE.

Proof.

SST =

(Yi Y)2

=

Y2i nY2

= YTY nY2.


10/23


SSR =

(Yi Y)2=

Y2i 2YYi =nY

+nY2

=Y2i nY2

= YTY nY2=

TXTX

nY2

= YTX(XTX)1XTX(XTX)1 =I

XTY nY2

= YTHY nY2.

We have seen (Lemma 3.3) that

SSE = YT(I H)Y

and so

SSR + SSE = Y

T

HY nY

2

+ Y

T

(IH)Y= YTY nY2

= SST

F-test for the Overall Significance of Regression

Suppose we wish to test the hypothesis

H0 : 1 = 2 = . . . = p1 = 0,

i.e. all coefficients except 0 are zero, versus

H1 : H0,

which means that at least one of the coefficients is non-zero. Under H0, the modelreduces to the null model

Y= 10 + ,


11/23

3.3. ANALYSIS OF VARIANCE 67

where 1 is a vector of ones.

In testing H0 we are asking if there is sufficient evidence to reject the null model.

The Analysis of variance table is given by

Source d.f. SS MS VR

Overall regression p 1 YTHY nY2 SSRp 1

M SRM SE

Residual n p YT(IH)Y SSEnp

Total n 1 YTY nY2

As in SLM we have n 1 total degrees of freedom. Fitting a linear model withp parameters (0, 1, . . . , p1) leaves n p residual d.f. Then the regression d.f.are n 1 (n p) = p 1.

It can be shown that E(SSE) = (n p)2, that is MSE is an unbiased estimatorof2. Also,

SSE2

2np

and if1 = . . . p1 = 0, then

SSR2

2p1.

The two statistics are independent, hence

MSRM SE

H0

Fp1,np.

This is a test function for the null hypothesis

H0 : 1 = 2 = . . . = p1 = 0,

versus

H1 : H0.

We reject H0 at the 100% level of significance if

Fobs > F;p1,np,

where F;p1,np is such that P(F < F;p1,np) = 1 .


12/23


3.4 Inferences about the parameters

In Theorem 3.2 we have seen that

Np(, 2(XTX)1).Therefore j N(j, 2cjj ), j = 0, 1, 2, . . . , p 1,where cjj is the jth diagonal element of (X

TX)1 (counting from 0 to p 1).Hence, it is straightforward to make inferences about j , in the usual way.

A 100(1 )% confidence interval for j isj t2

,np

S2cjj ,

where S2 = MSE.

The test statistic for H0 : j = 0 versus H1 : j = 0 is

T =jS2cjj

tnp ifH0 is true.

Care is needed in interpreting the confidence intervals and tests. They refer only tothe model we are currently fitting. Thus not rejecting H0 : j = 0 does not meanthat Xj has no explanatory power; it means that, conditional on X1, . . . , X j1, Xj+1, . . . , X p1being in the model Xj has no additional power.

It is often best to think of the test as comparing models without and with Xj , i.e.

H0 : E(Yi) = 0 + 1x1,i + + j1xj1,i + j+1xj+1,i + + p1xp1,i

versus

H1 : E(Yi) = 0 + 1x1,i + + p1xp1,i.

It does not tell us anything about the comparison between models E(Yi) = 0 andE(Yi) = 0 + jxj,i.

3.5 Confidence interval for

We haveE(Y) =

= X.


13/23

3.5. CONFIDENCE INTERVAL FOR 69

As with simple linear regression, we might want to estimate the expected response

at a specific x, say x0 = (1, x1,0, . . . , xp1,0)T, i.e.

0 = E(Y|X1 = x1,0, . . . , X p1 = xp1,0).

The point estimate will be 0 = xT0.Assuming normality, as usual, we can obtain a confidence interval for 0.

Theorem 3.4.

0 N(0, 2xT0 (X

TX)1x0).

Proof.

(i) 0 = xT0 is a linear combination of0,1, . . . ,p1, each of which is normal.Hence0 is also normal.

(ii)

E(

0) = E(x

T0

)

= xT

0 E()= xT0

= 0

(iii)

Var(0) = Var(xT0)= xT0 Var(

)x0= 2xT0 (X

TX)1x0.

The following corollary is a consequence of Theorem 3.4.

Corollary 3.2. A 100(1 )% confidence interval for0 is

0 t2

,np

S2xT0 (X

TX)1x0.


14/23


3.6 Predicting a new observation

To predict a new observation we need to take into account not only its expectation,

but also a possible new random error.

The point estimator of a new observation

Y0 =

Y|X1 = x1,0, . . . , X p1 = xp1,0

= 0 + 0

is

Y0 = x

T0

(=

0),

which, assuming normality, is such that

Y0 N(0, 2xT0 (XTX)1x0).Then Y0 0 N(0, 2xT0 (XTX)1x0)and Y0 (0 + 0) N(0, 2xT0 (XTX)1x0 + 2).That is

Y0 Y0 N(0, 2{1 + xT0 (XTX)1x0})and hence Y0 Y0

2{1 + xT0 (XTX)1x0}

N(0, 1).

As usual we estimate 2 by S2 and get

Y0 Y0S2{1 + xT0 (X

TX)1x0} tnp.

Hence a 100(1 )% prediction interval for Y0 is given by

Y0 t2

,np

S2{1 + xT0 (X

TX)1x0}.


15/23

3.7. MODEL BUILDING 71

3.7 Model Building

We have already mentioned the principle of parsimony; we should use the simplest

model that achieves our purpose.

It is easy to get a simple model (Yi = 0 + i) and it is easy to represent theresponse by the data themselves. However, the first is generally too simple and

the second is not a useful model. Achieving a simple model that describes the

data well is something of an art. Often, there is more than one model which does

a reasonable job.

Example 3.2. SalesA company is interested in the dependence of sales on promotional expenditure

(X1 in 1000), the number of active accounts (X2), the district potential (X3coded), and the number of competing brands (X4). We will try to find a goodmultiple regression model for the response variable Y (sales).

Data on last years sales (Y in 100,000) in 15 sales districts are given in the file

Sales.txt on the course website.

Figure 3.4: The Matrix Plot indicates that Yis clearly related to X4 and also to X2. The relationwith other explanatory variables is not that obvious.


16/23


Let us start with fitting a simple regression model of Y as a function ofX4 only.


Y = 396 - 25.1 X4


Constant 396.07 49.25 8.04 0.000

X4 -25.051 5.242 -4.78 0.000

S = 49.9868 R-Sq = 63.7% R-Sq(adj) = 60.9%


Source DF SS MS F P

Regression 1 57064 57064 22.84 0.000


Total 14 89547

We can see that the residuals versus fitted values indicate that there may be non-constant variance

and also the linearity of the model is questioned. We will add X2 to the model.

The regression equation isY = 190 - 22.3 X4 + 3.57 X2


Constant 189.83 10.13 18.74 0.000

X4 -22.2744 0.7076 -31.48 0.000

X2 3.5692 0.1333 26.78 0.000

S = 6.67497 R-Sq = 99.4% R-Sq(adj) = 99.3%



17/23


Source DF SS MS F P

Regression 2 89012 44506 998.90 0.000Residual Error 12 535 45

Total 14 89547

Source DF Seq SS

X4 1 57064

X2 1 31948

Still, there is some evidence that the standardized residuals may not have constant variance. Willthis be changed if we add X3 to the model?


Y = 190 - 22.3 X4 + 3.56 X2 + 0.049 X3


Constant 189.60 10.76 17.62 0.000

X4 -22.2679 0.7408 -30.06 0.000

X2 3.5633 0.1482 24.05 0.000

X3 0.0491 0.4290 0.11 0.911

S = 6.96763 R-Sq = 99.4% R-Sq(adj) = 99.2%


Source DF SS MS F P

Regression 3 89013 29671 611.17 0.000


Total 14 89547

Source DF Seq SS

X4 1 57064

X2 1 31948

X3 1 1


18/23


Not much better than before. Now, we add X1, the least related explanatory variable to Y.


Y = 177 - 22.2 X4 + 3.54 X2 + 0.204 X3 + 2.17 X1


Constant 177.229 8.787 20.17 0.000

X4 -22.1583 0.5454 -40.63 0.000

X2 3.5380 0.1092 32.41 0.000

X3 0.2035 0.3189 0.64 0.538

X1 2.1702 0.6737 3.22 0.009

S = 5.11930 R-Sq = 99.7% R-Sq(adj) = 99.6%


Source DF SS MS F P

Regression 4 89285 22321 851.72 0.000


Total 14 89547

Source DF Seq SS

X4 1 57064X2 1 31948

X3 1 1

X1 1 272


19/23


The residuals now do not contradict the model assumptions. We analyze the numerical output.

Here we see that X3 may be a redundant variable as we have no evidence to reject the hypothesis

that 3 = 0 given that all the other variables are in the model. Hence, we will fit a new model

without X3.


Y = 179 - 22.2 X4 + 3.56 X2 + 2.11 X1


Constant 178.521 8.318 21.46 0 .000

X4 -22.1880 0.5286 -41.98 0.000

X2 3.56240 0.09945 35.82 0.000

X1 2.1055 0.6479 3.25 0.008

S = 4.97952 R-Sq = 99.7% R-Sq(adj) = 99.6%


Source DF SS MS F P

Regression 3 89274 29758 1200.14 0.000


Total 14 89547

Source DF Seq SS

X4 1 57064

X2 1 31948

X1 1 262


20/23


These residual plots also do not contradict the model assumptions.

On its own variable X1 explains only 1% of the variation but once X2 and X4 are included in the

model then X1 is significant and also seems to cure problems with normality and non-constant

variance.

3.7.1 F-test for the deletion of a subset of variables

Suppose the overall regression model as tested by the Analysis of Variance table

is significant. We know that not all of the parameters are zero, but we may stillbe able to delete several variables.

We can carry out the Subset Testbased on the extra sum of squares principle. We

are asking if we can reduce the set of regressors

X1, X2, . . . , X p1

to, say,

X1, X2, . . . , X q1

(renumbering if necessary) where q < p, by omitting Xq, Xq+1, . . . , X p1.

We are interested in whether the inclusion of Xq, Xq+1, . . . , X p1 in the modelprovides a significant increase in the overall regression sum of squares or equiva-

lently a significant decrease in residual sum of squares.

The difference between the sums of squares is called the extra sum of squares due

to Xq, . . . , X p1 given X1, . . . , X q1 are already in the model and is defined bythe equation


21/23


SS(Xq, . . . , X p1|X1, . . . , X q1) = SS(X1, X2, . . . , X p1 ) SS(X1, X2, . . . , X q1 )regression SS for regression SS for

full model reduced model

= SS(red)E SS(full)E

residual SS under residual SS under

reduced model full model.

Notation:

LetT1 = (0, 1, . . . , q1)

T2 = (q, q+1, . . . , p1)

so that

=

12

.

Similarly divide X into two submatrices X1 and X2 so that X = (X1,X2),where

X1 =

1 x1,1 xq1,1...

......

1 x1,n xq1,n

X2 =

xq,1 xp1,1

......

xq,n xp1,n

.

The full model

Y= X + = X11 + X22 +

has

SS(full)R = Y

THY nY2

= TXTY nY2SS

(full)E = Y

T(I H)Y= YTY TXTY.Similarly the reduced model

Y= X11 +

has

SS(red)R =

T1XT1Y nY2SS

(red)E = Y

TY T1XT1Y.Hence the extra sum of squares is

SSextra = TXTY T1XT1Y.


22/23


To determine whether the change in sum of squares is significant, we test the

hypothesisH0 : q = q+1 = . . . = p1 = 0

versus

H1 : H0

It can be shown that, if H0 is true,

F =SSextra/(p q)

S2 Fpq,np

So, we reject H0 at the level if

F > F;pq,np

and conclude that there is sufficient evidence that some (but not necessarily all) of

the extra variables Xq, . . . , X p1 should be included in the model.

The ANOVA table is given by

Source d.f. SS MS V R

Overall regression p 1 SS(full)RX1,..,Xq1 q 1 SS

(red)R

Xq,..,Xp1|X1,..,Xq1 p q SSextra SSextrapqSSextra

(pq)M SE

Residual n p SSE MSETotal n 1 SST

In the ANOVA table we use the notation Xq, . . . , X p1|X1, . . . , X q1 to denotethat this is the effect of the variables Xq, . . . , X p1 given that the variables X1, . . . , X q1are already included in the model.

Note that as F1, distribution is equivalent to t2 we have that the F-test for H0 :

p

1= 0, that is for the inclusion of a single variable X

p

1, (this is the case

q = p 1) can also be performed by an equivalent T-test, where

T =p1

se(p1) tnpwhere se(p1) is the estimated standard error of p1.Also, note that we can repeatedly test individual parameters and we get the fol-

lowing Sums of Squares and degrees of freedom


23/23


Source of variation df SS

Full model p 1 SSRX1 1 SS(1)X2|X1 1 SS(2|1)X3|X1, X2 1 SS(3|1, 2)...

...

Xp1|X1, . . . , X p2 1 SS(p1|1, . . . , p2)Residual n p SSETotal n 1 SST

The output depends on the order the predictors are entered into the model. The

sequential sum of squares is the unique portion of SSR explained by a predictor,given any previously entered predictors. If you have a model with three predictors,

X1, X2, and X3, the sequential sum of squares for X3 shows how much of theremaining variation X3 explains given that X1 and X2 are already in the model.

多变量回归分析 multiple regression.pdf

Documents