advanced topics in regression tron anders moger 18.10.2006
TRANSCRIPT
Advanced topics in regression
Tron Anders Moger
18.10.2006
Last time:• Had the model death rate per 1000=a+b*car age+c*prop light trucks
Model Summary
,768a ,590 ,572 ,03871Model1
R R SquareAdjustedR Square
Std. Error ofthe Estimate
Predictors: (Constant), lghttrks, caragea.
ANOVAb
,097 2 ,049 32,402 ,000a
,067 45 ,001
,165 47
Regression
Residual
Total
Model1
Sum ofSquares df Mean Square F Sig.
Predictors: (Constant), lghttrks, caragea.
Dependent Variable: deathsb.
Coefficientsa
2,668 ,895 2,981 ,005 ,865 4,470
-,037 ,013 -,295 -2,930 ,005 -,063 -,012
,006 ,001 ,622 6,181 ,000 ,004 ,009
(Constant)
carage
lghttrks
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Lower Bound Upper Bound
95% Confidence Interval for B
Dependent Variable: deathsa.
Pearson’s r=√R2
R2=1-SSE/SST
Adj R2
=1-(SSE/(n-K-1))/(SST/(n-1))
SE(R2) SSR
SSESST
d.f.(SSR)=K
d.f.(SSE)=n-K-1
d.f.(SST)=n-1MSE=se
2=σ^=SSE/(n-K-1)
MSE=SSR/K
P-value fortest all β’s=0 vsAt least one β not 0
F test-statistic=MSR/MSE
β-estimates SE(β)T test-statistic=β/SE(β)
P-value fortest β=0 vs β not 0
95%CI for β
K=no independent variables
Why did we remove car weigth and percentage imported cars from the
model?• They did not show a significant relationship with
the dependent variable (β not different form 0)• Unless independent variables are completely
uncorrelated, you will get different b’s when including several variables in your model compared to just one variable (collinerarity)
• Hence, would like to remove variables that has nothing to do with the dependent variable, but still influence the effect of important independent variables
Relationship R2 and b
• Which result would make you most happy?
High R2, low b (with narrow CI)
Low R2, high b (with wide CI)
100
100
100
100
Centered variables• Remember, we found the model Birth weight=2369.672+4.429*mother’s weight• Hence, constant has no interpretation• Construct mother’s weight 2=mother’s weight-
mean(mother’s weight)• Get
• And the model Birth weight=2944.656+4.429*mother’s weight 2• Constant is now pred. birth weight for a 130 lbs
mother
Coefficientsa
2944,656 52,244 56,363 ,000 2841,592 3047,720
4,429 1,713 ,186 2,586 ,010 1,050 7,809
(Constant)
lwt2
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Lower Bound Upper Bound
95% Confidence Interval for B
Dependent Variable: birthweighta.
=130 lbs
Indicator variables• Binary variables (yes/no, male/female, …)
can be represented as 1/0, and used as independent variables.
• Also called dummy variables in the book. • When used directly, they influence only the
constant term of the regression• It is also possible to use a binary variable so
that it changes both constant term and slope for the regression
Example: Regression of birth weight with mother’s weight and smoking status as
independent variablesModel Summaryb
,259a ,067 ,057 707,83567Model1
R R SquareAdjustedR Square
Std. Error ofthe Estimate
Predictors: (Constant), smoking status, weight inpounds
a.
Dependent Variable: birthweightb. ANOVAb
6725224 2 3362612,165 6,711 ,002a
93191828 186 501031,335
99917053 188
Regression
Residual
Total
Model1
Sum ofSquares df Mean Square F Sig.
Predictors: (Constant), smoking status, weight in poundsa.
Dependent Variable: birthweightb.
Coefficientsa
2500,174 230,833 10,831 ,000 2044,787 2955,561
4,238 1,690 ,178 2,508 ,013 ,905 7,572
-270,013 105,590 -,181 -2,557 ,011 -478,321 -61,705
(Constant)
weight in pounds
smoking status
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Lower Bound Upper Bound
95% Confidence Interval for B
Dependent Variable: birthweighta.
Interpretation:
• Have fitted the modelBirth weight=2500.174+4.238*mother’s weight-270.013*smoking status
• If the mother start to smoke (and her weight remian constant), what is the predicted influence on the infatnt’s birth weight?
-270.013*1= -270 grams• What is the predicted weight of the child of a 150
pound, smoking woman?2500.174+4.238*150-270.013*1=2866 grams
Will R2 automatically be low forindicator variables?
0 1
What if categorical variable has more than two values?
• Example: Ethinicity; black, white, other
• For categorical variables with m possible values, use m-1 indicators.
• Important: A model with two indicator variables will assume that the effect of one indicator adds to the effect of the other
• If this may be unsuitable, use an additional interaction variable (product of indicators)
Model birth weight as a function of ethnicity
• Have constructed variables black=0 or 1 and other=0 or 1
• Model: Birth weight=a+b*black+c*others• Get
• Hence, predicted birth weight decrease by 384 grams for blacks and 299 grams for others
• Predicted birth weight for whites is 3104 grams
Coefficientsa
3103,740 72,882 42,586 ,000 2959,959 3247,521
-384,047 157,874 -,182 -2,433 ,016 -695,502 -72,593
-299,725 113,678 -,197 -2,637 ,009 -523,988 -75,462
(Constant)
black
other
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Lower Bound Upper Bound
95% Confidence Interval for B
Dependent Variable: birthweighta.
Interaction:• Sometimes the effect (on y) of one
independent variable (x1) depends on the value of another independent variable (x2)
• Means that you e.g get different slopes x1 for different values of x2
• Usually modelled by constructing a product of the two variables, and including it in the model
• Example: bwt=a+b*mwt+c*smoking+d*mwt*smoking =a+(b+d*smoking)*mwt+c*smoking
Get SPSS to do the estimation:
• Get bwt=2347+5.41*mwt+47.87*smoking-2.46*mwt*smoking
• Mwt=100 lbs vs mwt=200lbs for non-smokers:
bwt=2888g and bwt=3428g, difference=540g• Mwt=100 lbs vs mwt=200lbs for smokers:
bwt=2690g and bwt=2985g, difference=295g
Coefficientsa
2347,507 312,717 7,507 ,000 1730,557 2964,457
5,405 2,335 ,227 2,315 ,022 ,798 10,012
47,867 451,163 ,032 ,106 ,916 -842,220 937,953
-2,456 3,388 -,223 -,725 ,470 -9,140 4,229
(Constant)
weight in pounds
smoking status
smkwht
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Lower Bound Upper Bound
95% Confidence Interval for B
Dependent Variable: birthweighta.
What does this mean?• Mother’s weight
has a greater impact for birth weight for non-smokers than for smokers (or the other way round)
50,00 100,00 150,00 200,00 250,00
weight in pounds
0,00
1000,00
2000,00
3000,00
4000,00
5000,00
bir
thw
eig
ht
smoking status,00
1,00
R Sq Linear = 0,023
R Sq Linear = 0,042
What does this mean cont’d?• We see that the slope is steeper for non-
smokers
• In fact, a model with mwt and mwt*smoking fits better than the model mwt and smoking:
Model Summary
,264a ,070 ,060 706,85442Model1
R R SquareAdjustedR Square
Std. Error ofthe Estimate
Predictors: (Constant), smkwht, weight in poundsa.
Coefficientsa
2370,504 224,809 10,545 ,000 1927,000 2814,007
5,237 1,713 ,220 3,057 ,003 1,857 8,616
-2,106 ,792 -,191 -2,660 ,009 -3,668 -,544
(Constant)
weight in pounds
smkwht
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Lower Bound Upper Bound
95% Confidence Interval for B
Dependent Variable: birthweighta.
Should you always look at all possible interactions?
• No.• Example shows interaction between an indicator
and a continuous variable, fairly easy to interpret• Interaction between two continuous variables,
slightly more complicated• Interaction between three or more variables:
Difficult too interpret• Doesn’t matter if you have a good model, if you
can’t interpret it• Often interested in interactions you think are
there before you do the study
Multicollinearity
• Means that two or more independent variables are closely correlated
• To discover it, make plots and compute correlations (or make a regression of one parameter on the others)
• To deal with it: – Remove unnecessary variables– Define and compute an ”index”– If variables are kept, model could still be used for
prediction
Example: Traffic deaths
• Recall: Used four variables to predict traffic deaths in the U.S.
• Among them: Average car weight and prop. imported cars
• However, the correlation between these two variables is pretty high
Correlation car weight vs imp.cars
• Pearson r is 0.94:
• Problematic to use both of these as independents in a regression
0,00 5,00 10,00 15,00 20,00 25,00 30,00
impcars
3000,00
3200,00
3400,00
3600,00
3800,00
ve
hw
t
Correlations
1 ,011
,943
49 49
,011 1
,943
49 49
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
carage
impcars
carage impcars
Choice of variables
• Include variables which you believe have a clear influence on the dependent variable, even if the variable is ”uninteresting”: This helps find the true relationship between ”interesting” variables and the dependent.
• Avoid including a pair (or a set) of variables whose values are clearly linearily related
Choice of values
• Should have a good spread: Again, avoid collinearity
• Should cover the range for which the model will be used
• For categorical variables, one may choose to combine levels in a systematic way.
Specification bias
• Unless two independent variables are uncorrelated, the estimation of one will influence the estimation of the other
• Not including one variable which bias the estimation of the other
• Thus, one should be humble when interpreting regression results: There are probably always variables one could have added
Heteroscedasticity – what is it?
• In the standard regression model
it is assumed that all have the same variance.
• If the variance varies with the independent variables or dependent variable, the model is heteroscedastic.
• Sometimes, it is clear that data exhibit such properties.
0 1 1 2 2 ...i i i K Ki iy x x x
i
Heteroscedasticity – why does it matter?
• Our standard methods for estimation, confidence intervals, and hypothesis testing assume equal variances.
• If we go on and use these methods anyway, our answers might be quite wrong!
Heteroscedasticity – how to detect it?
• Fit a regression model, and study the residuals– make a plot of them against independent
variables– make a plot of them against the predicted
values for the dependent variable
• Possibility: Test for heteroscedasticity by doing a regression of the squared residuals on the predicted values.
Example: The model traffic deaths=a+b*car age+c*light trucks
• Does not look too bad
-2 -1 0 1 2
Regression Standardized Predicted Value
-3
-2
-1
0
1
2
3
4
Re
gre
ss
ion
Sta
nd
ard
ize
d R
es
idu
al
Dependent Variable: deaths
Scatterplot
What is bad?
• This: and this:
0 0y^ y^
εε
Heteroscedasticity – what to do about it?
• Using a transformation of the dependent variable– log-linear models
• If the standard deviation of the errors appears to be proportional to the predicted values, a two-stage regression analysis is a possibility
Dependence over time
• Sometimes, y1, y2, …, yn are not completely independent observations (given the independent variables). – Lagged values: yi may depend on yi-1 in
addition to its independent variables
– Autocorrelated errors: The residuals εi are correlated
• Often relevant for time-series data
Lagged values• In this case, we may run a multiple regression
just as before, but including the previous dependent variable yi-1 as a predictor variable for yi.
• Use the model yt=β0+β1x1+γyt-1+εt
• A 1-unit increase in x1 in first time period yields an expected increase in y of β1, an increase β1γ in the second period, β1γ2 in the third period and so on
• Total expected increase in all future is β1/(1-γ)
Example: Pension funds from textbook CD
• Want to use the market return for stocks (say, in millon $) as a predictor for the percentage of pension fund portifolios at market value (y) at the end of the year
• Have data for 25 yrs->24 observations Model Summaryb
,980a ,961 ,957 2,288 1,008Model1
R R SquareAdjustedR Square
Std. Error ofthe Estimate
Durbin-Watson
Predictors: (Constant), lag, returna.
Dependent Variable: stocksb. Coefficientsa
1,397 2,359 ,592 ,560 -3,509 6,303
,235 ,030 ,359 7,836 ,000 ,172 ,297
,954 ,042 1,041 22,690 ,000 ,867 1,042
(Constant)
return
lag
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Lower Bound Upper Bound
95% Confidence Interval for B
Dependent Variable: stocksa.
Get the model:
• yt=1.397+0.235*stock return+0.954*yt-1
• A one million $ increase in stock return one year yields a 0.24% increase in pension fund portifolios at market value
• For the next year: 0.235*0.954=0.22%
• And the third year: 0.235*0.9542=0.21%
• For all future: 0.235/(1-0.954)=5.1%
• What if you have a 2 million $ increase?
Autocorrelated errors
• In the standard regression model, the errors are independent.
• Using standard regression formulas anyway can lead to errors: Typically, the uncertainty in the result is underestimated.
Autocorrelation – how to detect?
• Plotting residuals against time!
• The Durbin-Watson test compares the possibility of independent errors with a first-order autoregressive model: 1t t tu
21
2
2
1
( )n
t tt
n
tt
e ed
e
Test statistic:
Option in SPSS
Test depends on K (no. of independent variables), n (no. observations) andsig.level αTest H0: ρ=0 vs H1: ρ=0Reject H0 if d<dL
Accept H0 if d>dU
Inconclusive if dL<d<dU
Example: Pension funds
Model Summaryb
,980a ,961 ,957 2,288 1,008Model1
R R SquareAdjustedR Square
Std. Error ofthe Estimate
Durbin-Watson
Predictors: (Constant), lag, returna.
Dependent Variable: stocksb.
•Want to test ρ=0 on 5%-level•Test statistic d=1.008•Have one independent variable (K=1 in table 12 on p. 876) and n=24•Find critical values of dL=1.27 and dU=1.45•Reject H0
Autocorrelation – what to do?
• It is possible to use a two-stage regression procedure: – If a first-order auto-regressive model with
parameter is appropriate, the model
will have uncorrelated errors
• Estimate from the Durbin-Watson statistic, and estimate from the model above
1 0 1 1 1, 1 , 1 1(1 ) ( ) ... ( )t t t t K Kt K t t ty y x x x x
1t t
Next time:
• What if the assumption of normality for your data is invalid?
• You have to forget all you have learnt so far, and do something else
• Non-parametric statistics