regressionfaculty.ecnu.edu.cn/picture/article/220/30/97/d953ae2a... · · 2015-12-13ata so what...
TRANSCRIPT
REVIEW Scatter plots between X and Y Covariance and correlation Correlation coefficients Correlation reveals the linear relationship
between two variables but NOT causal relationship
12/13/2015
2
陈芳
华东师大英语系
OVERVIEW Regression
The regression line Intercept and slope estimates in a regression line
Unstandardized Standardized
Accuracy of prediction Handling outliers Hypothesis testing Prediction as a purpose SPSS Extension
12/13/2015
3
陈芳
华东师大英语系
REGRESSION Regression is the process of finding the best
fitting line using the data we have. This line is most often used to predict or
estimate a score of one variable (Y) from another (X). Predict for future occurrences Estimate a mean value Examples
12/13/2015
4
陈芳
华东师大英语系
PREDICTING A SCORE If we know nothing else but
that the mean psychological symptom score of students in ECNU is 90 (based on measures such as Hopkins Symptom Checklist).
We know a student comes
from ECNU
12/13/2015
6
陈芳
华东师大英语系
PREDICTING A SCORE If we have some extra information
such as the level of stress of ECNU students and we know that level of stress is related to the number of symptoms. We could give a more precise estimate of a student’s symptom score given that we know his stress level.
E.g. I can tell you the student’s stress score is 42, now your estimated symptom score maybe 105.
E.g. I can tell you that the student’s stress score is 18, now your estimated symptom score is 92.
12/13/2015
7
18 42
陈芳
华东师大英语系
FIND THE LINE Find the line is the same as finding a and b that
defines the line.
12/13/2015
8
陈芳
华东师大英语系
THE REGRESSION LINE
Think back to your high school math class…the equation for a straight line is:
12/13/2015
9
Predicted value of Y
Slope of regression
line
Value of X
Intercept
bXaY +=ˆ
陈芳
华东师大英语系
THE BEST-FITTING LINE ---VISUAL HELP the best-fitting line minimizes the errors of
prediction. the difference between observed Y and predicted
Ŷ is called the residual, it is represented as (Y- Ŷ ).
Ordinal least squared estimation: sum the square of the residuals.
Draw on the board. No matter how the line tilts, it will always pass one
point, that is
12/13/2015
10
陈芳
华东师大英语系
),( YX
THE REGRESSION LINE
The slope of the best fitting line we just described can be estimated as below, this can be proven, but we will not do that here.
==
X
Y
X
XY
ssr
sb 2
cov
12/13/2015
11
bXaY +=ˆ
Pearson correlation between X and Y.
Standard deviation of Y.
Standard deviation of X.
陈芳
华东师大英语系
THE REGRESSION LINE
After we solve for the slope (b), we can find the intercept a with:
12/13/2015
12
NXbY
XbYa ∑ ∑−=−=
bXaY +=ˆ
陈芳
华东师大英语系
THE REGRESSION LINE
Interpretation: One point increase in X value corresponds to b points increase in Y value.
12/13/2015
13
bXaY +=ˆ
These are our regression coefficients
陈芳
华东师大英语系
STANDARDIZED DATA So what do we do if our data are standardized—that is,
both X and Y are z-scores? The mean of X in this case is ____. The mean of Y in this case is also ____. The standard deviation of X and Y are both _____.
Think now about the equations we used to estimate our regression coefficient—and plug in the values above ( )0 0 0a Y bX b= − = − =
11
Y
X
sb r r r
s = = =
12/13/2015
14
XbXbXaY β+=+=+= 00ˆ
SPSS call this beta, standardized coefficient.
陈芳
华东师大英语系
STANDARDIZED DATA
If our variables are standardized, the slope b is equal to the correlation r and the intercept is zero! (when we have only one predictor)
The standardized regression coefficient for slope is often noted as β
This changes our interpretation a bit…
12/13/2015
15
XY β+= 0ˆ
陈芳
华东师大英语系
INTERPRETATION OF STANDARDIZED SLOPE COEFFICIENT (BETA)
This changes our interpretation a bit… Now, an one unit increase in X or Y corresponds to
an increase of one standard deviation If β = 0.75, this means that an increase of one
standard deviation in X relates to a ¾ standard deviation increase in Y.
12/13/2015
16
XY β+= 0ˆ
陈芳
华东师大英语系
HYPOTHESIS TESTING
Regression has two coefficients (slope & intercept) We still use t-test to test this. In SPSS, the t-test has a p-value attached to it. As usual,
if p<alpha(usually 0.05), we reject the null hypothesis and conclude the coefficient is significantly different from 0.
12/13/2015
17
陈芳
华东师大英语系
JUST FOR YOUR INFORMATION t-test for correlation r. You can use look up the critical t value. But we will
leave this to SPSS. t-test for slope coefficient b. It follows a t distribution
with n-2 degrees of freedom.
You can use either equation to double-check the t statistic in the following example
12/13/2015
18
[ ]21
22
−=−
−== ndf
rNr
srtr
21)1(
1)(1)(
12ˆˆ
−−
−
−=
−=
−
=−−
NNrs
Nsbs
Nsb
Nss
bt
Y
X
YY
X
X
YY
陈芳
华东师大英语系
SPSS DEMO
12/13/2015
19
Tab10.1 data online. Predicting symptom scores from level of stress. Can you write the equation for this best fitting line?
Intercept
Slope
Standardized slope coefficient. =r (simple regression only.)
Does not include 0. Yes, significant.
Smaller than .05. Significant. Reject null hypothesis and conclude the slope is different from 0.
We don’t care about this.
Unstandardized slope coefficient
陈芳
华东师大英语系
PREDICTED VS. ACTUAL SCORE
For a score of X=25 we can predict a score of about 93.5 on Y. (How do we find the value?)
We can compare how the real values of Y that occurred with a score of X = 25 compare to our prediction.
12/13/2015
20
陈芳
华东师大英语系
STANDARD ERROR OF ESTIMATE (Y – Ŷ) is the error of prediction, or the residual. Remember sum of squared error or SSerror? The standard error of estimate (SEE) is defined as This equation, like standard deviation, describes on
average, how our predicted scores are from the actual scores.
SEE can be more easily estimated with:
12/13/2015
21
( )2ˆ
112YY Y
Ns s rN−
− = − −
∑ − 2)ˆ( YY
22)ˆ( 2
.ˆ −=
−−
== ∑− N
SSN
YYss error
XYYY
陈芳
华东师大英语系
SEE AS A MEASURE OF THE ACCURACY OF PREDICTION SEE is one way we can assess how well our
regression equation is working. predicting with a dependent variable is better that
predicting without one, but it is not without error. If the regression equation works well, it’s safe to assume
the predicted values of Y should be very close to the real values.
The smaller the SEE, the better the equation predicts.
12/13/2015
22
陈芳
华东师大英语系
R2 AS A MEASURE OF ACCURACY OF PREDICTION
In this equation, SSY is the total variance in Y that are composed of two parts a. The part of the variability in Y that is associated
with X, , learn to use the following phrases. that can be explained by X that can be predicted by X that is accounted by X that is attributable to X
b. The part of the variability in Y that is independent
of X, which is SSerror.
12/13/2015
23
Y
errorY
Y
Y
SSSSSS
SSSS
r −== ˆ2
YSS ˆ
陈芳
华东师大英语系
PREDICTABLE VARIABILITY & R2
The higher the r2, the better the predictors are working (the more variance in Y that are explained by the predictors).
If a correlation (r) is found to be 0.8, we calculate r2 = 0.82 = 0.64
How do we interpret this? This means that 64% of the variance of Y can be
explained by the variability of X. You can use the phrases on the previous slide
interchangeably. Remember, this does NOT mean that 64% of Y is
caused by X.
12/13/2015
27
陈芳
华东师大英语系
SPSS DEMO Table 10.6 data. Predicting course quality from grades. Interpret R2. Based on the regression equation, what overall rate will be
if the student’s expected grade is 3.1?
12/13/2015
28
陈芳
华东师大英语系
WHAT AFFECTS REGRESSION ANALYSIS? REMEMBER THIS?
4 major factors affect the calculation of correlations: 1. Nonlinearity of relationship 2. Restriction of range (or variance) of X and/or Y 3. Use of heterogeneous sub-samples 4. Outliers
12/13/2015
29
陈芳
华东师大英语系
USE SPSS Sleep and mood: This is fictional data Create scatter plot to check the linear
relationship assumption. If not, transformation of data might be necessary.
Check for outliers. Delete outliers.
Read the output. Make your conclusion. Interpret the results. Evaluate model fit.
12/13/2015
31
陈芳
华东师大英语系
SOME EXTENSIONS OF REGRESSION Multiple Regression
This method involves the prediction of Y from several independent variables. There is no regression line in this case We calculate a single a (intercept) for this multiple
regression, but each IV has a separate b (slope) estimated.
These slopes are not the exactly the slope of the line, but more like a weighting function .
We can perform a hypothesis test to see if each of these weights help predict Y using t-test. Again, SPSS gives help.
Our r2 should increase as we add more predictors—We can explain more and more of the variance in Y by adding different predictors
12/13/2015
34
陈芳
华东师大英语系