advanced topics in regression tron anders moger 18.10.2006

Advanced topics in regression

Tron Anders Moger

18.10.2006

Last time:• Had the model death rate per 1000=a+b*car age+c*prop light trucks

Model Summary

,768a ,590 ,572 ,03871Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Predictors: (Constant), lghttrks, caragea.

ANOVAb

,097 2 ,049 32,402 ,000a

,067 45 ,001

,165 47

Regression

Residual

Total

Model1

Sum ofSquares df Mean Square F Sig.

Predictors: (Constant), lghttrks, caragea.

Dependent Variable: deathsb.

Coefficientsa

2,668 ,895 2,981 ,005 ,865 4,470

-,037 ,013 -,295 -2,930 ,005 -,063 -,012

,006 ,001 ,622 6,181 ,000 ,004 ,009

(Constant)

carage

lghttrks

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig. Lower Bound Upper Bound

95% Confidence Interval for B

Dependent Variable: deathsa.

Pearson’s r=√R2

R2=1-SSE/SST

Adj R2

=1-(SSE/(n-K-1))/(SST/(n-1))

SE(R2) SSR

SSESST

d.f.(SSR)=K

d.f.(SSE)=n-K-1

d.f.(SST)=n-1MSE=se

2=σ^=SSE/(n-K-1)

MSE=SSR/K

P-value fortest all β’s=0 vsAt least one β not 0

F test-statistic=MSR/MSE

β-estimates SE(β)T test-statistic=β/SE(β)

P-value fortest β=0 vs β not 0

95%CI for β

K=no independent variables

Why did we remove car weigth and percentage imported cars from the

model?• They did not show a significant relationship with

the dependent variable (β not different form 0)• Unless independent variables are completely

uncorrelated, you will get different b’s when including several variables in your model compared to just one variable (collinerarity)

• Hence, would like to remove variables that has nothing to do with the dependent variable, but still influence the effect of important independent variables

Relationship R2 and b

• Which result would make you most happy?

High R2, low b (with narrow CI)

Low R2, high b (with wide CI)

100

100

100

100

Centered variables• Remember, we found the model Birth weight=2369.672+4.429*mother’s weight• Hence, constant has no interpretation• Construct mother’s weight 2=mother’s weight-

mean(mother’s weight)• Get

• And the model Birth weight=2944.656+4.429*mother’s weight 2• Constant is now pred. birth weight for a 130 lbs

mother

Coefficientsa

2944,656 52,244 56,363 ,000 2841,592 3047,720

4,429 1,713 ,186 2,586 ,010 1,050 7,809

(Constant)

lwt2

Model1

B Std. Error


Beta




Dependent Variable: birthweighta.

=130 lbs

Indicator variables• Binary variables (yes/no, male/female, …)

can be represented as 1/0, and used as independent variables.

• Also called dummy variables in the book. • When used directly, they influence only the

constant term of the regression• It is also possible to use a binary variable so

that it changes both constant term and slope for the regression

Example: Regression of birth weight with mother’s weight and smoking status as

independent variablesModel Summaryb

,259a ,067 ,057 707,83567Model1



Predictors: (Constant), smoking status, weight inpounds

a.

Dependent Variable: birthweightb. ANOVAb

6725224 2 3362612,165 6,711 ,002a

93191828 186 501031,335

99917053 188

Regression

Residual

Total

Model1

Sum ofSquares df Mean Square F Sig.

Predictors: (Constant), smoking status, weight in poundsa.

Dependent Variable: birthweightb.

Coefficientsa

2500,174 230,833 10,831 ,000 2044,787 2955,561

4,238 1,690 ,178 2,508 ,013 ,905 7,572

-270,013 105,590 -,181 -2,557 ,011 -478,321 -61,705

(Constant)

weight in pounds

smoking status

Model1

B Std. Error


Beta





Interpretation:

• Have fitted the modelBirth weight=2500.174+4.238*mother’s weight-270.013*smoking status

• If the mother start to smoke (and her weight remian constant), what is the predicted influence on the infatnt’s birth weight?

-270.013*1= -270 grams• What is the predicted weight of the child of a 150

pound, smoking woman?2500.174+4.238*150-270.013*1=2866 grams

Will R2 automatically be low forindicator variables?

0 1

What if categorical variable has more than two values?

• Example: Ethinicity; black, white, other

• For categorical variables with m possible values, use m-1 indicators.

• Important: A model with two indicator variables will assume that the effect of one indicator adds to the effect of the other

• If this may be unsuitable, use an additional interaction variable (product of indicators)

Model birth weight as a function of ethnicity

• Have constructed variables black=0 or 1 and other=0 or 1

• Model: Birth weight=a+b*black+c*others• Get

• Hence, predicted birth weight decrease by 384 grams for blacks and 299 grams for others

• Predicted birth weight for whites is 3104 grams

Coefficientsa

3103,740 72,882 42,586 ,000 2959,959 3247,521

-384,047 157,874 -,182 -2,433 ,016 -695,502 -72,593

-299,725 113,678 -,197 -2,637 ,009 -523,988 -75,462

(Constant)

black

other

Model1

B Std. Error


Beta





Interaction:• Sometimes the effect (on y) of one

independent variable (x1) depends on the value of another independent variable (x2)

• Means that you e.g get different slopes x1 for different values of x2

• Usually modelled by constructing a product of the two variables, and including it in the model

• Example: bwt=a+b*mwt+c*smoking+d*mwt*smoking =a+(b+d*smoking)*mwt+c*smoking

Get SPSS to do the estimation:

• Get bwt=2347+5.41*mwt+47.87*smoking-2.46*mwt*smoking

• Mwt=100 lbs vs mwt=200lbs for non-smokers:

bwt=2888g and bwt=3428g, difference=540g• Mwt=100 lbs vs mwt=200lbs for smokers:

bwt=2690g and bwt=2985g, difference=295g

Coefficientsa

2347,507 312,717 7,507 ,000 1730,557 2964,457

5,405 2,335 ,227 2,315 ,022 ,798 10,012

47,867 451,163 ,032 ,106 ,916 -842,220 937,953

-2,456 3,388 -,223 -,725 ,470 -9,140 4,229

(Constant)

weight in pounds

smoking status

smkwht

Model1

B Std. Error


Beta





What does this mean?• Mother’s weight

has a greater impact for birth weight for non-smokers than for smokers (or the other way round)

50,00 100,00 150,00 200,00 250,00

weight in pounds

0,00

1000,00

2000,00

3000,00

4000,00

5000,00

bir

thw

eig

ht

smoking status,00

1,00

R Sq Linear = 0,023

R Sq Linear = 0,042

What does this mean cont’d?• We see that the slope is steeper for non-

smokers

• In fact, a model with mwt and mwt*smoking fits better than the model mwt and smoking:

Model Summary

,264a ,070 ,060 706,85442Model1



Predictors: (Constant), smkwht, weight in poundsa.

Coefficientsa

2370,504 224,809 10,545 ,000 1927,000 2814,007

5,237 1,713 ,220 3,057 ,003 1,857 8,616

-2,106 ,792 -,191 -2,660 ,009 -3,668 -,544

(Constant)

weight in pounds

smkwht

Model1

B Std. Error


Beta





Should you always look at all possible interactions?

• No.• Example shows interaction between an indicator

and a continuous variable, fairly easy to interpret• Interaction between two continuous variables,

slightly more complicated• Interaction between three or more variables:

Difficult too interpret• Doesn’t matter if you have a good model, if you

can’t interpret it• Often interested in interactions you think are

there before you do the study

Multicollinearity

• Means that two or more independent variables are closely correlated

• To discover it, make plots and compute correlations (or make a regression of one parameter on the others)

• To deal with it: – Remove unnecessary variables– Define and compute an ”index”– If variables are kept, model could still be used for

prediction

Example: Traffic deaths

• Recall: Used four variables to predict traffic deaths in the U.S.

• Among them: Average car weight and prop. imported cars

• However, the correlation between these two variables is pretty high

Correlation car weight vs imp.cars

• Pearson r is 0.94:

• Problematic to use both of these as independents in a regression

0,00 5,00 10,00 15,00 20,00 25,00 30,00

impcars

3000,00

3200,00

3400,00

3600,00

3800,00

ve

hw

t

Correlations

1 ,011

,943

49 49

,011 1

,943

49 49

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

carage

impcars

carage impcars

Choice of variables

• Include variables which you believe have a clear influence on the dependent variable, even if the variable is ”uninteresting”: This helps find the true relationship between ”interesting” variables and the dependent.

• Avoid including a pair (or a set) of variables whose values are clearly linearily related

Choice of values

• Should have a good spread: Again, avoid collinearity

• Should cover the range for which the model will be used

• For categorical variables, one may choose to combine levels in a systematic way.

Specification bias

• Unless two independent variables are uncorrelated, the estimation of one will influence the estimation of the other

• Not including one variable which bias the estimation of the other

• Thus, one should be humble when interpreting regression results: There are probably always variables one could have added

Heteroscedasticity – what is it?

• In the standard regression model

it is assumed that all have the same variance.

• If the variance varies with the independent variables or dependent variable, the model is heteroscedastic.

• Sometimes, it is clear that data exhibit such properties.

0 1 1 2 2 ...i i i K Ki iy x x x

i

Heteroscedasticity – why does it matter?

• Our standard methods for estimation, confidence intervals, and hypothesis testing assume equal variances.

• If we go on and use these methods anyway, our answers might be quite wrong!

Heteroscedasticity – how to detect it?

• Fit a regression model, and study the residuals– make a plot of them against independent

variables– make a plot of them against the predicted

values for the dependent variable

• Possibility: Test for heteroscedasticity by doing a regression of the squared residuals on the predicted values.

Example: The model traffic deaths=a+b*car age+c*light trucks

• Does not look too bad

-2 -1 0 1 2

Regression Standardized Predicted Value

-3

-2

-1

0

1

2

3

4

Re

gre

ss

ion

Sta

nd

ard

ize

d R

es

idu

al

Dependent Variable: deaths

Scatterplot

What is bad?

• This: and this:

0 0y^ y^

εε

Heteroscedasticity – what to do about it?

• Using a transformation of the dependent variable– log-linear models

• If the standard deviation of the errors appears to be proportional to the predicted values, a two-stage regression analysis is a possibility

Dependence over time

• Sometimes, y1, y2, …, yn are not completely independent observations (given the independent variables). – Lagged values: yi may depend on yi-1 in

addition to its independent variables

– Autocorrelated errors: The residuals εi are correlated

• Often relevant for time-series data

Lagged values• In this case, we may run a multiple regression

just as before, but including the previous dependent variable yi-1 as a predictor variable for yi.

• Use the model yt=β0+β1x1+γyt-1+εt

• A 1-unit increase in x1 in first time period yields an expected increase in y of β1, an increase β1γ in the second period, β1γ2 in the third period and so on

• Total expected increase in all future is β1/(1-γ)

Example: Pension funds from textbook CD

• Want to use the market return for stocks (say, in millon $) as a predictor for the percentage of pension fund portifolios at market value (y) at the end of the year

• Have data for 25 yrs->24 observations Model Summaryb

,980a ,961 ,957 2,288 1,008Model1



Durbin-Watson

Predictors: (Constant), lag, returna.

Dependent Variable: stocksb. Coefficientsa

1,397 2,359 ,592 ,560 -3,509 6,303

,235 ,030 ,359 7,836 ,000 ,172 ,297

,954 ,042 1,041 22,690 ,000 ,867 1,042

(Constant)

return

lag

Model1

B Std. Error


Beta




Dependent Variable: stocksa.

Get the model:

• yt=1.397+0.235*stock return+0.954*yt-1

• A one million $ increase in stock return one year yields a 0.24% increase in pension fund portifolios at market value

• For the next year: 0.235*0.954=0.22%

• And the third year: 0.235*0.9542=0.21%

• For all future: 0.235/(1-0.954)=5.1%

• What if you have a 2 million $ increase?

Autocorrelated errors

• In the standard regression model, the errors are independent.

• Using standard regression formulas anyway can lead to errors: Typically, the uncertainty in the result is underestimated.

Autocorrelation – how to detect?

• Plotting residuals against time!

• The Durbin-Watson test compares the possibility of independent errors with a first-order autoregressive model: 1t t tu

21

2

2

1

( )n

t tt

n

tt

e ed

e

Test statistic:

Option in SPSS

Test depends on K (no. of independent variables), n (no. observations) andsig.level αTest H0: ρ=0 vs H1: ρ=0Reject H0 if d<dL

Accept H0 if d>dU

Inconclusive if dL<d<dU

Example: Pension funds

Model Summaryb

,980a ,961 ,957 2,288 1,008Model1



Durbin-Watson

Predictors: (Constant), lag, returna.

Dependent Variable: stocksb.

•Want to test ρ=0 on 5%-level•Test statistic d=1.008•Have one independent variable (K=1 in table 12 on p. 876) and n=24•Find critical values of dL=1.27 and dU=1.45•Reject H0

Autocorrelation – what to do?

• It is possible to use a two-stage regression procedure: – If a first-order auto-regressive model with

parameter is appropriate, the model

will have uncorrelated errors

• Estimate from the Durbin-Watson statistic, and estimate from the model above

1 0 1 1 1, 1 , 1 1(1 ) ( ) ... ( )t t t t K Kt K t t ty y x x x x

1t t

Next time:

• What if the assumption of normality for your data is invalid?

• You have to forget all you have learnt so far, and do something else

• Non-parametric statistics

advanced topics in regression tron anders moger 18.10.2006

Documents