regression diagnostics and remedial...model. consider the following standard multiple linear...

24
REGRESSION DIAGNOSTICS T. Krishnan Cranes Software International Limited, Mahatma Gandhi Road, Bangalore - 560 001 [email protected] 1. Introduction The main aim of regression modelling and analysis is to develop a good predictive relationship between the dependent (response) and independent (predictor) variables. Regression Diagnostics plays a vital role in finding and validating such a relationship. In this presentation, we discuss issues that arise in the development of a multiple linear regression model. Consider the following standard multiple linear regression model: Y = β 0 + β 1 x 1 + β 2 x 2 + … + β p x p + ε where Y is a response variable and the x's are predictor variables, β 's are the (regression) parameters to be estimated from data, and ε is the error or residual. If the data set for this regression model building consists of n cases, then there are n, (p+1)-dimensional observations (y i , x_ 1i , x_ 2i , …, x_ pi ), i = 1, 2, …, n, in the data set where ε i’ s are distributed normally and independently with 0 mean and same variance ο 2 . This model is generally fitted under the following assumptions: Independence: Observations (and hence residuals) are statistically independently distributed. Normality: The residuals are normally distributed with zero mean. Homoscedasticity: All the observations (and hence the residuals) have the same variance. The estimation method generally used is Least-Squares, although the maximum likelihood method is also sometimes used; under assumptions of normality the estimates of β 's are the same by both the methods. For least-squares estimation, the normality assumption is not needed; it is needed when sampling distributions are considered, say in testing hypotheses. Regression is mainly used for predicting the value of Y at given values of the x's (within their observed range or close to it) employing the model developed using the data (often called the training data or training sample). Sometimes regression is used as an explanatory tool. The predicted value of Y at given values x 1 , x 2 , …, x p is obtained by using the estimates of the β ’s in the model without the ε . We generally look for a regression relation that permeates all the observations and does not derive largely from one or two. Thus we would like to identify those observations that are outliers (do not belong to the population, in some sense), or those that exert undue influence on the regression coefficients, one way or another. By eliminating them we may be in a position to let the regression relation derived from the rest permeate these remaining observations.

Upload: phamminh

Post on 24-Mar-2018

221 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: REGRESSION DIAGNOSTICS and Remedial...model. Consider the following standard multiple linear regression model: Y = β0 + β1x1 + ... random error. The square root of this statistic

REGRESSION DIAGNOSTICS

T. Krishnan Cranes Software International Limited, Mahatma Gandhi Road, Bangalore - 560 001

[email protected]

1. Introduction The main aim of regression modelling and analysis is to develop a good predictive relationship between the dependent (response) and independent (predictor) variables. Regression Diagnostics plays a vital role in finding and validating such a relationship. In this presentation, we discuss issues that arise in the development of a multiple linear regression model. Consider the following standard multiple linear regression model: Y = β 0 + β 1x1 + β 2 x2 + … + β p xp+ ε where Y is a response variable and the x's are predictor variables, β 's are the (regression) parameters to be estimated from data, and ε is the error or residual. If the data set for this regression model building consists of n cases, then there are n, (p+1)-dimensional observations (yi, x_1i, x_2i, …, x_pi), i = 1, 2, …, n, in the data set whereε i’s are distributed normally and independently with 0 mean and same variance ο 2. This model is generally fitted under the following assumptions:

• Independence: Observations (and hence residuals) are statistically independently distributed.

• Normality: The residuals are normally distributed with zero mean. • Homoscedasticity: All the observations (and hence the residuals) have the same

variance. The estimation method generally used is Least-Squares, although the maximum likelihood method is also sometimes used; under assumptions of normality the estimates of β 's are the same by both the methods. For least-squares estimation, the normality assumption is not needed; it is needed when sampling distributions are considered, say in testing hypotheses. Regression is mainly used for predicting the value of Y at given values of the x's (within their observed range or close to it) employing the model developed using the data (often called the training data or training sample). Sometimes regression is used as an explanatory tool. The predicted value of Y at given values x1, x2, …, xp is obtained by using the estimates of the β ’s in the model without the ε . We generally look for a regression relation that permeates all the observations and does not derive largely from one or two. Thus we would like to identify those observations that are outliers (do not belong to the population, in some sense), or those that exert undue influence on the regression coefficients, one way or another. By eliminating them we may be in a position to let the regression relation derived from the rest permeate these remaining observations.

Page 2: REGRESSION DIAGNOSTICS and Remedial...model. Consider the following standard multiple linear regression model: Y = β0 + β1x1 + ... random error. The square root of this statistic

Regression Diagnostics

III-20

2. Simple Linear Regression: Example Data Set and Scatterplot Let us start with an example of Simple Linear Regression and understand various elements of a typical computer software output. Here Simple as opposed to Multiple means that the number of predictor variables is one. The data file LINSEED (presented in the Appendix) contains data on leaf area and stem length of 34 linseed plants. The object is to develop a (linear) formula to predict leaf area from stem length. Let us first look at the SCATTERPLOT of leaf area (Variable name: area, on y-axis) against stem length (Variable Name: length, on x-axis) to see if there is a relationship and if so whether it is linear.

Figure 1: Scatter Plot of Area vs Length

0 50 100 150LENGTH

20

30

40

50

60

70

80

90

100

110

AR

EA

The plot looks reasonably linear and it appears that the leaf area can be successfully predicted from stem length. Basic Regression Computation Let us carry out the simple linear regression computations. Dependent Variable ¦ AREA N ¦ 34 Multiple R ¦ 0.976 Squared Multiple R ¦ 0.953 Adjusted Squared Multiple R ¦ 0.952 Standard Error of Estimate ¦ 7.189

Page 3: REGRESSION DIAGNOSTICS and Remedial...model. Consider the following standard multiple linear regression model: Y = β0 + β1x1 + ... random error. The square root of this statistic

Regression Diagnostics

III-21

Regression Coefficients ¦ Effect ¦ Coefficient Standard Std. Tolerance t p-value Error Coefficient ----------+----------------------------------------------------------- CONSTANT ¦ 5.749 3.479 0.000 . 1.653 0.108 LENGTH ¦ 1.333 0.052 0.976 1.000 25.555 0.000 ¦ Effect ¦ Coefficient 95.0% Confidence Interval ----------+----------------------------------------------- CONSTANT ¦ 5.749 -1.337 12.835 LENGTH ¦ 1.333 1.226 1.439 Analysis of Variance Source ¦ SS df Mean Squares F-ratio p-value ------------+-------------------------------------------------- Regression ¦ 33749.813 1 33749.813 653.071 0.000 Residual ¦ 1653.716 32 51.679 From this analysis we get the regression equation as: leaf area = 5.749 + 1.333 × stem length 3. Multiple Correlation In the same output is a statistic called the squared multiple correlation (squared multiple R). This R2 has two equivalent interpretations:

• It is the square of the correlation between Y and its estimate; • It is R2 = Regression Sum of Squares/Total Sum of Squares

where Total Sum of Squares = Regression Sum of Squares + Residual Sum of Squares R2= 1- Residual Sum of Squares/Total Sum of Squares Thus, R2 gives the percentage of variation explained by the predictor and hence is an useful indicator of the usefulness of the fitted regression. Thus the R2 here is the variation in the dependent variable, area, accounted for by the linear prediction using length. The value here (0.953) tells us that approximately 95% of the variation in area can be accounted for by a linear prediction from length. The rest of the variation, as far as this model is concerned, is random error. The square root of this statistic is called, not surprisingly, the multiple correlation (multiple R). The multiple R2 and the p-value in the ANOVA show a fairly good predictive value of stem length; the residual variance is fairly small and the confidence intervals narrow, showing that the regression estimate and the predictive power are good. Adjusted R2

Notice that the output has, besides R2, a quantity called Adjusted R2. The adjusted squared multiple R (0.952) is what we would expect the squared multiple correlation to be if we used the model we just estimated on a new sample of 34 plants. It is smaller than the squared multiple correlation because the coefficients were optimized for this sample rather than for the new one.

Page 4: REGRESSION DIAGNOSTICS and Remedial...model. Consider the following standard multiple linear regression model: Y = β0 + β1x1 + ... random error. The square root of this statistic

Regression Diagnostics

III-22

4. Regression Diagnostics To have a model and estimates that mean something, however, you should be sure the assumptions are reasonable and that the sample data appear to be sampled from a population that meets the assumptions. The sample analogues of the errors in the population model are the residuals---the differences between the observed and predicted values of the dependent variable. Saving Residuals for Diagnosis Various useful statistics from the regression computation---residuals, data, model, coefficients, etc. should be saved for examining the results of the regression analysis. These saved statistics can be used for further analysis especially for diagnostics of the kind that is discussed in this presentation. The residuals can be used to examine the cases vis a vis the regression. Notice that the mean of the residuals is zero and the variance of the residuals (with a denominator of 32, the DF of the Residual) is 51.679 the Residual MS, an estimate of the error variance. The saved residuals are given in a table in the Appendix. Normal Probability Plot There are many diagnostics you can perform on the residuals. Here are the most important ones: To diagnose if the errors are normally distributed, draw a normal probability plot (PPLOT) of the residuals.

Figure 2:Normal PPLOT of the Residuals

The residuals should fall approximately on a diagonal straight line in this plot. When the sample size is small, as in our example, the line may be quite jagged. It is difficult to tell by any method whether a small sample is from a normal population.

Page 5: REGRESSION DIAGNOSTICS and Remedial...model. Consider the following standard multiple linear regression model: Y = β0 + β1x1 + ... random error. The square root of this statistic

Regression Diagnostics

III-23

Histogram of Residuals You can also plot a histogram or stem-and-leaf diagram of the residuals to see if they are lumpy in the middle with thin, symmetric tails.

Figure 3:Histogram of the Residuals

From the histogram, the errors appear to be normally distributed. Q-Q Plot of Residuals You can make a Quantile Plot (Q-Q Plot)} where the residuals are plotted against their percentage (empirical cumulative probability) point (0 to 1); if normality holds this should have an S shape.

Figure 4:Quantile Plot of the Residuals

The S-shape of the curve seems to suggest that normality assumption is satisfied.

Page 6: REGRESSION DIAGNOSTICS and Remedial...model. Consider the following standard multiple linear regression model: Y = β0 + β1x1 + ... random error. The square root of this statistic

Regression Diagnostics

III-24

Normality Tests on Residuals You can also carry out formal tests of normality---Shapiro-Wilk test and/or Anderson-Darling test with the following results: ¦ RESIDUAL -------------------------------------+--------- N of Cases ¦34.000 Standard Deviation ¦ 7.079 Shapiro-Wilk Statistic ¦ 0.974 Shapiro-Wilk p-value ¦ 0.573 Anderson-Darling Statistic ¦ 0.299 Adjusted Anderson-Darling Statistic ¦ 0.306 p-value ¦>0.15 The plots and the tests indicate that the normality assumption on the residuals can be deemed to be satisfied. Residuals vs Predicted Value Plot To examine if the errors are independent, several plots can be done. Look at the plot of residuals against estimated values. Make sure that the residuals are randomly scattered above and below the 0 horizontal and that they do not track in a snaky way across the plot. If they look as if they were shot at the plot by a horizontally moving machine gun, then they are probably not independent of each other.

Figure 5: Residual vs Predicted Value

Although the mean of the residual may be accepted to be zero at each x-value (or predicted value), the variance seems to increase with x-value suggesting a possible violation of the homoscedasticity assumption.

Page 7: REGRESSION DIAGNOSTICS and Remedial...model. Consider the following standard multiple linear regression model: Y = β0 + β1x1 + ... random error. The square root of this statistic

Regression Diagnostics

III-25

Standardized Residuals The standardized residual is more useful since they are more easy to interpret. The following plot shows standardized residuals against estimated values (ESTIMATE). Use these statistics to identify outliers in the dependent variable space. Large values (greater than 2 or 3 in absolute magnitude) indicate possible problems.

Figure 6: Standardized Residuals vs Predicted Values

Case no. 16 with a standardized value of -2 can be considered an outlier. If we delete this case from the data set and recompute the regression, the fit becomes better, but whether we should do it or not needs some careful consideration. Autocorrelation Plot You may also want to plot residuals against other variables, such as time, orientation, or other ways that might influence the variability of your dependent measure. ACF PLOT in TIME SERIES measures whether the residuals are serially correlated. All the bars should be within the confidence bands if each residual is not predictable from the one preceding it, and the one preceding that, and the one preceding that, and so on.

Figure 7

Page 8: REGRESSION DIAGNOSTICS and Remedial...model. Consider the following standard multiple linear regression model: Y = β0 + β1x1 + ... random error. The square root of this statistic

Regression Diagnostics

III-26

There seems to be a serial correlation of order 1 and so the independence assumption may not be satisfied. Durbin-Watson Test for Autocorrelation You may also conduct a formal test for Autocorrelation with the Durbin-Watson test. In this case the results are: Durbin-Watson D Statistic ¦ 1.192 First Order Autocorrelation ¦ 0.387 This statistic is meant to be a kind of test for the validity of the independence assumption. It tests a limited independence hypothesis that the first-order autocorrelation of the residuals is zero against an alternative that it is not zero. It is not a foolproof test of independence. In this case the significance of a first-order autocorrelation of 0.387 is judged by the Durbin-Watson statistic comparing it to the percentage points given in say, Draper and Smith (1998). This test does not show significance. So this data set may not fulfill the homoscedasticity assumption but seems to satisfy normality assumption and possibly also the independence assumption on the basis of which the regression has been computed. Summary So Far The basic regression output has

• estimates of regression coefficients; • their standard errors, confidence intervals and tests of their significance; • analysis of variance for regression which produces an overall test of significance and

an estimate of the error variance as the residual (error) mean-square; • multiple correlation coefficient, its square, and its adjusted value, which give a

measure of how much of the variation has been captured by the predictor variables and hence how useful the regression is;

• first-order autocorrelation of the data and the Durbin-Watson statistic to test its significance; notice that this depends on the order in which data are arranged. Thus the test is useful only if the data are arranged in some ‘natural' order of their occurrence.

• the option to save residuals, data, coefficients, etc. Using the saved residuals you can make suitable plots to examine the assumption of normality; carry out a formal test of significance for normality; make suitable plots to examine the assumption of homoscedasticity; make suitable plots to examine the assumption of independence.

Thus some basic diagnosis can be performed of the validity of assumptions under which a standard regression analysis is carried out using this regression output. Errors vs Residuals We defined errors ε i = 1, 2, …, n as independent and identically distributed normal with 0 mean and variance ο 2. We used the residuals as their estimates. Notice that the residuals obey the condition that their sum is zero. Thus they are not independent. Further, their

Page 9: REGRESSION DIAGNOSTICS and Remedial...model. Consider the following standard multiple linear regression model: Y = β0 + β1x1 + ... random error. The square root of this statistic

Regression Diagnostics

III-27

variances depend on the xi and the farther the xi's are from their mean, the larger is the variance of the residuals. Thus standardization by dividing by the standard deviation of the residuals is not adequate. The covariance matrix of the residuals take the form ο 2(I-H)

where the matrix H depends on the observations xi . The ith diagonal entry hii in H is thus related to the variance of the ith residual and is called the leverage of the ith observation and indicates the effect of the ith case (observation). Notice that the leverage does not depend on the observations on the response variable, but only on the observations on the predictor variables. Points for which the values of the predictor variables are much different from the average values of predictor variables are called high leverage points. The closer hii is to 1, the more is the leverage. A useful thumb rule in this context is: if the leverage is > 2p/n, then the case is regarded as one of high leverage. Studentized Residuals This leads to the notion of Studentization. We have seen that the variance of ε i is ο 2 (1-hii ). Thus an appropriate quantity to standardize a residual is the square root of this variance. Such a standardization is called Studentization and thus the Studentized Residual for the ith

observation is the square root of an estimate of this variance. For this the Residual Mean Square from the ANOVA can be used as an estimate of ο 2 and hii can be computed from the known x values. Most regression software compute the Studentized residuals also and when you save regression output, you can save the Studentized residuals also. If the leverage and hence the Studentized residual is large, we consider the case as an outlier. It does not appear from the Studentized residuals and leverage that any of the observations in the LINSEED data set is an outlier.

Figure 8: Studentized Residuals vs Predicted Values

Case no. 16 with a studentized value of -2 can be considered an outlier. If we delete this case from the data set and recompute the regression, the fit becomes better, but whether we should do it or not needs some careful consideration.

Page 10: REGRESSION DIAGNOSTICS and Remedial...model. Consider the following standard multiple linear regression model: Y = β0 + β1x1 + ... random error. The square root of this statistic

Regression Diagnostics

III-28

Outliers Outliers are those points that display a different pattern of relationship between the response and predictors than other data points. A point whose inclusion makes a great deal of difference to the regression coefficients is called an influential point. They can be outliers and/or high leverage points. Outliers without high leverage or high leverage points which are not outliers are not generally influential points. Thus it is important to distinguish between two types of ‘’outliers''. Outliers with respect to the response variable are the real outliers and they indicate model misfit. Outliers with respect to the predictors are called leverage points. Their presence in the data set can also affect the regression model, even if the corresponding responses are not outliers. However, they have a tendency to determine the regression coefficients by themselves, and cause changes in the standard errors of regression coefficients estimates. Multiple Linear Regression:Example Basic Regression Computation Let us analyze a data set with more than one predictor. Consider the data set ARSENIC (presented in the Appendix) on 21 individuals (cases) where the dependent variable is the amount of arsenic in the nails (variable name: arsnails), the predictors being the amount of arsenic in the water (variable name: arswater) and the individual's age. It is important to save the residuals for diagnostic analysis. The residuals file is presented in the Appendix. We present below a part of the output upon running a regression. Dependent Variable ¦ ARSNAILS N ¦ 21 Multiple R ¦ 0.897 Squared Multiple R ¦ 0.804 Adjusted Squared Multiple R ¦ 0.783 Standard Error of Estimate ¦ 0.227 Regression Coefficients Effect ¦ Coefficient Standard Std. Tolerance t p-value Error Coefficient ----------+----------------------------------------------- CONSTANT ¦ 0.199 0.161 0.000 . 1.240 0.231 ARSWATER ¦13.149 1.609 0.908 0.881 8.174 0.000 AGE ¦-0.001 0.003 -0.033 0.881 -0.293 0.773 Regression Coefficients Effect ¦ Coefficient 95.0% Confidence Interval ----------+--------------------------------------------- CONSTANT ¦ 0.199 -0.138 0.537 ARSWATER ¦ 13.149 9.769 16.528 AGE ¦ -0.001 -0.008 0.006

Page 11: REGRESSION DIAGNOSTICS and Remedial...model. Consider the following standard multiple linear regression model: Y = β0 + β1x1 + ... random error. The square root of this statistic

Regression Diagnostics

III-29

Analysis of Variance Source ¦ SS df Mean Squares F-ratio p-value ------------+--------------------------------------------- Regression ¦ 3.814 2 1.907 37.036 0.000 Residual ¦ 0.927 18 0.051 *** WARNING *** : Case 14 has large Leverage (Leverage: 0.768 ) Case 14 is an Outlier (Studentized Residual : 4.338 ) Case 14 has large Influence (Cook Distance: 10.430 ) Case 17 is an Outlier (Studentized Residual : -7.317 ) Durbin-Watson D Statistic ¦ 1.525 First Order Autocorrelation ¦ 0.234

Stem and Leaf Plot of Variable: LEVERAGE, N = 21 Minimum : 0.053351951Lower Hinge : 0.059521963Median : 0.072999442Upper Hinge : 0.131436293Maximum : 0.767889509

0 H 555555 0 M 666677 0 9 1 0 1 H 23 1 5 1 1 2 2 2 * * * Outside Values * * * 2 4 4 0 7 6

It appears that the effect of age is negligible on arsenic in nails, and the effect of arsenic in water is significant and has good predictive power. Let us concentrate on the four warning messages in the output. Cases 14 and 17 have data-related problems and are outliers. Case 14 has a large leverage and hence a large Studentized residual. Case 17 does not have such a large leverage, but since the residual itself is large, its Studentized residual is large.The residuals statistics are given in the Appendix. Influential Points and Cook's Distance We now notice a new statistic, called Cook in the saved Residuals file. It is Cook’s Distance. Cook's distance measures the influence of each sample observation on the coefficient estimates. Observations that are far from the average of all the independent variable values or that have large residuals tend to have a large Cook's distance value (say, greater than 2).

Page 12: REGRESSION DIAGNOSTICS and Remedial...model. Consider the following standard multiple linear regression model: Y = β0 + β1x1 + ... random error. The square root of this statistic

Regression Diagnostics

III-30

Cook's D actually closely follows an F distribution, so aberrant values depend on the sample size. As a rule of thumb, under the normal regression assumptions, COOK can be compared to an F distribution with p and n-p degrees of freedom. We don't want to find a large Cook's D value for an observation because it would mean that the coefficient estimates would change substantially if we deleted that observation. Only one observation, no. 14 has COOK distance large (> 10) and hence is an influential observation. Another diagnostic statistic useful for assessing the model fit is leverage, discussed in Belsley, Kuh, and Welsch (1980). Leverage helps to identify outliers in the independent variable space. Leverage has an average value of (p+1)/n, where (p+1) is the number of estimated parameters (including the constant) and n is the number of cases. What is a high value of leverage? In practice, it is useful to examine the values in a stem-and-leaf plot and identify those that stand apart from the rest of the sample. However, various rules of thumb have been suggested. For example, values of leverage less than 0.2 appear to be safe; between 0.2 and 0.5, risky; and above 0.5, to be avoided. Another says that if p > 6 and (N - p) > 12, use as a cutoff. An F approximation is often used to determine this value for warnings (Belsley, Kuh, and Welsch, 1980). In conclusion, keep in mind that all our diagnostic tests are themselves a form of inference. We can assess theoretical errors only through the dark mirror of our observed residuals. Despite this caveat, testing assumptions graphically is critically important. You should never publish regression results until you have examined these plots. Such diagnostics where we examine the effect of a case by deleting the case from the data and carrying out an analysis, are called deletion diagnostics. Cook's distance is one such in a suite of statistics in leave-one-out deletion diagnostics often used in linear and generalized linear models. See Belsley, Kuh and Welsch (1980) and Cook and Weisberg (1982), for more details. Multiple Regression:Multicollinearity Let us consider another multiple regression example with quite a few predictor variables. An experiment was conducted to calibrate a near infrared reflectance (NIR) instrument for the measurement of protein content of ground wheat samples. Protein content measurements were made by the standard Kjeldahl method, and the six values L1-L6 are measurements of the reflectance of NIR radiation by the wheat samples at six different wavelengths in the range 1680-2310nm. The data are given in the file PROTEIN. Let us first look at the correlation matrix of the predictor variables with the following commands: Correlation Matrix of NIR Reflectance at Six Wavelengths L1-L6 Pearson Correlation Matrix ¦ L1 L2 L3 L4 L5 L6 ----+---------------------------------------------- L1 ¦ 1.000 L2 ¦ 0.994 1.000 L3 ¦ 0.996 0.999 1.000 L4 ¦ 0.995 0.980 0.984 1.000 L5 ¦ 0.937 0.925 0.934 0.954 1.000 L6 ¦ 0.989 0.988 0.989 0.989 0.949 1.000

Page 13: REGRESSION DIAGNOSTICS and Remedial...model. Consider the following standard multiple linear regression model: Y = β0 + β1x1 + ... random error. The square root of this statistic

Regression Diagnostics

III-31

Figure 9: Protein Content of Ground Wheat Samples Scatter Plot Matrix of NIR Reflectance at Six Wavelengths L1-L6

You immediately notice from the correlation table that the predictor variables are highly inter-correlated and from the SPLOM (acronym for Scatter PLOt Matrix that the predictor variables have a very high degree of linear relationship. This means that their covariance matrix or correlation matrix or the matrix X’X is nearly singular. In the least-squares regression computation, this matrix is to be inverted. This inversion operation is fraught with problems similar to division by zero or a very small number; the results, if any, will be unstable and is subject to serious errors. Such problems are called ill-conditioned problems. In the case of correlation matrix, the situation is known as multicollinearity or simply collinearity. Effects of multicollinearity in regression computations are that the least-squares estimates are poor estimates and result in high variances and covariances of parameter estimates. This situation also means that it may be possible to eliminate some variables which may be redundant in the presence of others and fit a satisfactory regression. However, this exercise may not always be easy since near-linear dependence, especially involving a number of variables is not easy to discern. The SPLOM gave a simple way of detecting multicollinearity. There are other numerical means for multicollinearity detection. These routinely form a part of the output of multiple linear regression programs. For the PROTEIN regression, they are as follows:

Page 14: REGRESSION DIAGNOSTICS and Remedial...model. Consider the following standard multiple linear regression model: Y = β0 + β1x1 + ... random error. The square root of this statistic

Regression Diagnostics

III-32

Eigenvalues of Unit Scaled X’X 1 2 3 4 5 6 7 ---------------------------------------------------- 5.991 1.007 0.001 0.000 0.000 0.000 0.000 Condition Indices 1 2 3 4 5 6 7 --------------------------------------------------------- 1.000 2.439 64.418 129.282 300.187 1285.534 2616.633 Variance Proportions ¦ 1 2 3 4 5 6 7 ----------+----------------------------------------------- CONSTANT ¦0.000 0.000 0.000 0.012 0.170 0.194 0.624 L1 ¦0.000 0.000 0.000 0.000 0.001 0.003 0.996 L2 ¦0.000 0.000 0.003 0.010 0.000 0.739 0.248 L3 ¦0.000 0.000 0.000 0.000 0.000 0.975 0.024 L4 ¦0.000 0.000 0.000 0.002 0.019 0.006 0.973 L5 ¦0.000 0.000 0.140 0.205 0.005 0 278 0.372 L6 ¦0.000 0.003 0.007 0.138 0.130 0.091 0.629 Dependent Variable ¦ PROTEIN N ¦ 24 Multiple R ¦ 0.991 Squared Multiple R ¦ 0.982 Adjusted Squared Multiple R ¦ 0.976 Standard Error of Estimate ¦ 0.220 Regression Coefficients ¦ Effect ¦Coefficient Standard Std. Tolerance t p-value Error Coefficient ----------+----------------------------------------------- CONSTANT ¦ 23.074 9.899 0.000 . 2.331 0.032 L1 ¦ 0.028 0.082 0.658 0.000 0.342 0.736 L2 ¦ 0.002 0.087 0.033 0.000 0.019 0.985 L3 ¦ 0.235 0.077 5.025 0.000 3.035 0.007 L4 ¦ -0.240 0.063 -5.182 0.001 -3.803 0.001 L5 ¦ 0.012 0.006 0.371 0.029 1.932 0.070 L6 ¦ -0.036 0.046 -0.427 0.004 -0.782 0.445 Regression Coefficients Effect ¦ Coefficient 95.0% Confidence Interval VIF ----------+----------------------------------------------------- CONSTANT ¦ 23.074 2.189 43.959 . L1 ¦ 0.028 -0.145 0.201 3512.157 L2 ¦ 0.002 -0.182 0.186 2890.931 L3 ¦ 0.235 0.072 0.398 2610.334 L4 ¦ -0.240 -0.374 -0.107 1767.519 L5 ¦ 0.012 -0.001 0.025 35.070 L6 ¦ -0.036 -0.132 0.060 283.855 Analysis of Variance Source ¦ SS df Mean Squares F-ratio p-value ------------+--------------------------------------------- Regression ¦ 45.409 6 7.568 155.885 0.000 Residual ¦ 0.825 17 0.049

Page 15: REGRESSION DIAGNOSTICS and Remedial...model. Consider the following standard multiple linear regression model: Y = β0 + β1x1 + ... random error. The square root of this statistic

Regression Diagnostics

III-33

*** WARNING *** : Case 4 is an Outlier (Studentized Residual : 3.016 ) Case 17 has large Leverage (Leverage : 0.628 ) Durbin-Watson D Statistic ¦ 1.772 First Order Autocorrelation ¦ 0.113 Residuals and data have been saved. Tolerance and Variance Inflating Factors Let us consider both the Y variable and the X variables in a standardized form. Let Ri

2 be the R2 (squared multiple correlation) when variable Xi is regressed on the remaining predictor variables (including constant). The tolerance of Xi is defined to be TOLi = 1-Ri

2. The Variance Inflation Factor---VIF of predictor variable Xi is defined as VIFi = 1/ TOLi , and is a measure of the amount by which the variance of the standardized regression coefficient is inflated by multicollinearity. Generally TOLi < 0.1 or equivalently VIFi > 10 is regarded as a sign of multicollinearity in respect of the predictor and a similar behaviour of the average TOL or VIF a sign of multicollinearity of the entire model. In this data set, it is evident that there is an intolerable amount of multicollinearity on all counts. Condition Indices and Variance Proportions Another aspect of multicollinearity is the investigation of the computational issues of instability of the X’X matrix. For this, we first scale the X matrix so that the columns have unit sums of squares. This makes the trace of the matrix X’X equal to be (p+1), so that no units of measurement are involved. A singular matrix X has as many zero singular values (or eigenvalues of X’X) as the deficiency in its rank. Near-singular matrices have small singular values; the scaling of trace = p+1 (with average eigenvalue to be 1) makes it easier to judge this `smallness'. Thus the condition Indices CIi , i = 1, 2, …, p+1 are defined as the square root of the ratio of the largest eigenvalue to the ith eigenvalue. Note that this eigen-analysis is the same as a principal component analysis of the predictor variables. Thus the eigenvalues are variances of the principal components and the eigenvectors make up the coefficients for the principal components. Variance Proportions are the proportions of the variance of the estimates of regression coefficients accounted for by each principal component associated with each of the above eigenvalues. Thus the table gives for each variable (along the rows), the proportion of the variance of its regression coefficient estimate, that is due to each of the principal components (or factors represented by columns); the rows sum up to 1. A condition index of > 30 is taken as a yardstick for serious collinearity problems, and a condition index of > 15 a sign of possibility of problems. All these condition indices will be worth investigating with the help of variance (decomposition) proportions as follows: If a factor with a large condition index is associated with more than one variable with > 0.5 of variance proportion, then these variables are the source of multicollinearity. This is done by browsing each row (variable) of the Variance Proportions table and noting if two or more columns (factors) corresponding large condition indices (latter columns) have values > 0.5.

Page 16: REGRESSION DIAGNOSTICS and Remedial...model. Consider the following standard multiple linear regression model: Y = β0 + β1x1 + ... random error. The square root of this statistic

Regression Diagnostics

III-34

In the table above, all factors except 1 and 2 have large condition indices. Looking through the columns of components 3, 4, 5, 6, 7, we notice that only for factors 6 and 7, there are large (> 0.5) variance proportions; factor 6 has for predictors L2 and L3; factor 7 has for predictors L1, L4, L6. Thus all but L5 seem difficult predictors; moreover, L5 has the smallest VIF (although it is unacceptably high!). However, a simple regression with L5 only turns out to be a poor model. In this data set, most of the eigenvalues are nearly zero, indicating that the predictors are a relatively redundant set. Finally, we could look at the Correlation Matrix of Regression Coefficients estimates. For this data set, these estimates are highly correlated, giving further evidence that there is too much multicollinearity in the data to provide stable estimates.

Correlation Matrix of Regression Coefficients CONSTANTL1 L2 L3 L4 CONSTANT 1.000 L1 -0.777 1.000 L2 0.761 -0.4541.000 L3 -0.305 -0.207-0.7691.000 L4 0.762 -0.9750.556 0.070 1.000 L5 -0.270 0.626 0.176 -0.611-0.586 L6 -0.569 0.770 -0.6970.175 -0.842

Other Multiple Regression Output Elements Many Regression programs provide a few more items in their output. They are as follows. They are not of much use in diagnostics and hence they have not been discussed here. Step-wise Regression: Variables are included or excluded in a stepwise fashion depending on their use in the presence of those that are already included. This approach may help in variable selection tasks. Prediction: Predicted values are given in the column titled ESTIMATE, for the cases in the training sample when a residuals/data file is saved. The standard errors of these estimates are given in the column SEPRED, confidence intervals are given in columns LCL, UCL, and prediction intervals are given in columns LPL, UPL. Moreover, if a data file containing predictor values is invoked using the PREDICTION tab, predictions are given for these values. Bootstrap: Regression coefficients can be subjected to a bootstrap analysis, obtaining various types of bootstrap estimates, confidence intervals and p-values. Information Criteria: Akaike Information Criterion (AIC), AIC corrected (AICc), and Schwarz's Bayesian Information criterion values are often given in the regression output. These criteria are useful for model comparisons and model selection.

Page 17: REGRESSION DIAGNOSTICS and Remedial...model. Consider the following standard multiple linear regression model: Y = β0 + β1x1 + ... random error. The square root of this statistic

Regression Diagnostics

III-35

Remedies for Regression Problems We shall quickly mention some of the remedies available for the problems encountered in the above sections. Non-Normality: Transformations, especially Box-Cox family; Nonparametric Regression (e.g., Rank Regression); Robust Regression Heteroscedasticity: Weighted least-squares Lack of Independence: Time Series Regression Outliers, points of high leverage, and/or influence points could be omitted (after careful consideration) and regression re-fitted. Alternatively, suitable Robust Regression methods (such as Least Absolute Deviation, Least Median of Squares, Scale, M, Modified M, Minimum covariance Determinant) could be attempted, which suitably downplay the role of these difficult data points. Nonlinearity: Nonlinearity in terms of predictors can be tackled in the framework of linear models and linear regression; for instance xi

2, log xi can be linear predictors. However, a model with nonlinearity in parameters cannot be handled in the linear model or in the linear regression framework. More complicated algorithms are required to tackle such regression models. In statistical parlance, a nonlinear model is a model with nonlinearity in parameters. References Belsey, D.A., Kuh, E., and Welsch, R.E. (1980). Regression diagnostics. New York: John

Wiley & Sons. Cook, R.D. and Weisberg, S. (1982). Residuals and influence in regression. New York:

Chapman and Hall. Draper, N.R. and Smith, H. (1998). Applied regression analysis. New York: John Wiley &

Sons.

Page 18: REGRESSION DIAGNOSTICS and Remedial...model. Consider the following standard multiple linear regression model: Y = β0 + β1x1 + ... random error. The square root of this statistic

Regression Diagnostics

III-36

Appendix

Title: Leaf Area and Stem Length of Linseed Plants Data File Name: LINSEED Reference: Clewer, A.G. and Scarisbrick, D.H. (2001). Practical statistics and experimental design for plant and cropscience. New York: John Wiley & Sons. (Section 7.1.1, p.64) Study Description: The object of this exercise is to develop a (linear) formula to predict leaf area from stem length in linseed plants. Number of Cases: 34 Variable Names and Description: Area: Leaf area in cm2

Length: Stem Length: in cm AREA LENGTH 22 31 23 36 25 35 27 36 30 50 30 49 35 52 38 56 45 55 50 68 51 80 52 76 55 80 59 78 60 96 60 100 62 86 65 106 67 108 68 96 70 98 75 102 78 96 79 100 80 104 81 110 82 116 84 120 87 126 88 125 89 132

Page 19: REGRESSION DIAGNOSTICS and Remedial...model. Consider the following standard multiple linear regression model: Y = β0 + β1x1 + ... random error. The square root of this statistic

Regression Diagnostics

III-37

92 138 105 139 107 142

Residual Statistics for Linseed Regression EST RES LEV COOK STU SEPRED LCL UCL LPL UPL AREA LEN 20.977 1.023 0.124 0.003 0.204 1.855 17.198 24.755 9.603 32.351 22 31 24.554 -1.554 0.108 0.006 -0.308 1.734 21.021 28.086 13.259 35.848 23 36 23.838 1.162 0.111 0.003 0.231 1.758 20.257 27.419 12.528 35.148 25 35 24.554 2.446 0.108 0.015 0.486 1.734 21.021 28.086 13.259 35.848 27 36 34.568 -4.568 0.072 0.032 -0.898 1.414 31.687 37.449 23.460 45.677 30 50 33.853 -3.853 0.074 0.023 -0.755 1.436 30.928 36.778 22.733 44.973 30 49 35.999 -0.999 0.068 0.001 -0.193 1.372 33.205 38.793 24.913 47.085 35 52 38.860 -0.860 0.060 0.001 -0.166 1.290 36.233 41.487 27.815 49.906 38 56 38.145 6.855 0.062 0.060 1.362 1.310 35.477 40.813 27.090 49.200 45 55 47.444 2.556 0.042 0.005 0.490 1.076 45.253 49.636 36.495 58.394 50 68 56.028 -5.028 0.032 0.015 -0.969 0.937 54.120 57.937 45.132 66.925 51 80 53.167 -1.167 0.034 0.001 -0.222 0.973 51.186 55.148 42.257 64.077 52 76 56.028 -1.028 0.032 0.001 -0.195 0.937 54.120 57.937 45.132 66.925 55 80 54.598 4.402 0.033 0.012 0.846 0.953 52.656 56.539 43.695 65.500 59 78 67.474 -7.474 0.031 0.033 -1.467 0.925 65.590 69.358 56.581 78.366 60 96 70.335 -10.3350.033 0.068 -2.099 0.955 68.389 72.281 59.432 81.239 60 100 60.320 1.680 0.030 0.002 0.319 0.907 58.473 62.168 49.434 71.207 62 86 74.627 -9.627 0.038 0.068 -1.942 1.022 72.545 76.710 63.699 85.556 65 106 76.058 -9.058 0.040 0.064 -1.817 1.050 73.919 78.197 65.119 86.997 67 108 67.474 0.526 0.031 0.000 0.100 0.925 65.590 69.358 56.581 78.366 68 96 68.905 1.095 0.032 0.001 0.208 0.939 66.993 70.816 58.007 79.802 70 98 71.766 3.234 0.034 0.007 0.619 0.975 69.780 73.752 60.855 82.677 75 102 67.474 10.526 0.031 0.066 2.141 0.925 65.590 69.358 56.581 78.366 78 96 70.335 8.665 0.033 0.048 1.724 0.955 68.389 72.281 59.432 81.239 79 100 73.197 6.803 0.036 0.032 1.331 0.997 71.165 75.228 62.278 84.116 80 104 77.489 3.511 0.042 0.010 0.675 1.079 75.290 79.687 66.537 88.440 81 110 81.781 0.219 0.050 0.000 0.042 1.180 79.377 84.184 70.786 92.775 82 116 84.642 -0.642 0.057 0.000 -0.124 1.255 82.086 87.198 73.613 95.671 84 120 88.934 -1.934 0.068 0.005 -0.375 1.377 86.130 91.738 77.845 100.023 87 126 88.219 -0.219 0.066 0.000 -0.042 1.356 85.457 90.980 77.141 99.297 88 125 93.226 -4.226 0.082 0.031 -0.833 1.508 90.155 96.297 82.067 104.385 89 132 97.518 -5.518 0.098 0.066 -1.107 1.645 94.167 100.86 86.279 108.758 92 138 98.233 6.767 0.100 0.102 1.373 1.669 94.835 101.63286.980109.487 105 139 100.3796.621 0.109 0.109 1.349 1.740 96.836 103.92389.081111.678 107 142 Title: Arsenic in Water and in Toenails Data File Name: ARSENIC Reference: Karagas, M.R., Morris, J.S., Weiss, J.E., Spate, V., Baskett, C., and Greenberg, E.R. (1965). Toenail samples as an indicator of drinking water arsenic exposure. Cancer Epidemiology, Biomarkers and Prevention, 5, 849-852. Study Description: This data file contains measurements of drinking water and toenail levels of arsenic, as well as related covariates, for 21 individuals with private wells in New Hampshire. The object of the study is to relate arsenic levels in drinking water to arsenic levels in the toenails taking into consideration other factors such as age. Number of Cases: 21 Variable Names and Description: Age: Age (years) Gender: Gender of person ( m=Male, f=Female) Drinkuse: Household well used for drinking (1 <= ¼; 2=1/4; 3=1/2; 4=3/4; 5 >= 3/4) COOKUSE: Household well used for cooking(1 <= ¼; 2=1/4; 3=1/2; 4=3/4; 5 >=3/4) Arswater: Arsenic in water (ppm) Arsnails: Arsenic in toenails (ppm)

Page 20: REGRESSION DIAGNOSTICS and Remedial...model. Consider the following standard multiple linear regression model: Y = β0 + β1x1 + ... random error. The square root of this statistic

Regression Diagnostics

III-38

Source: Therese Stukel Dartmouth Hitchcock Medical Center One Medical Center Dr.Lebanon, NH 03756. e-mail: [email protected]

AGE GENDER DRINKUSE COOKUSE ARSWATER ARSNAILS 44 "f" 5 5 0.001 0.119 45 "f" 4 5 0.000 0.118 44 "m" 5 5 0.000 0.099 66 "f" 3 5 0.001 0.118 37 "m" 2 5 0.000 0.277 45 “f" 5 5 0.000 0.358 47 “m" 5 5 0.000 0.080 38 "f" 4 5 0.001 0.158 41 “f" 3 2 0.000 0.310 49 "f" 4 5 0.000 0.105 72 "f" 5 5 0.000 0.073 45 "f" 1 5 0.046 0.832 53 "m" 5 5 0.019 0.517 86 “f" 5 5 0.137 2.252 8 "f" 5 5 0.021 0.851 32 "f" 5 5 0.018 0.269 44 "m" 5 5 0.076 0.433 63 "f" 5 5 0.000 0.141 42 "m" 5 5 0.017 0.275 62 "m" 5 5 0.000 0.135 36 "m" 5 5 0.004 0.175

Residual Statistics for Arsenic Regression EST RES LEV COOK STU SEPRE LCL UCL LPL UPL 0.167 -0.048 0.058 0.001 -0.214 0.055 0.052 0.283 -0.323 0.658 0.158 -0.040 0.059 0.001 -0.175 0.055 0.042 0.274 -0.333 0.648 0.156 -0.057 0.060 0.001 -0.252 0.055 0.040 0.272 -0.335 0.647 0.149 -0.031 0.154 0.001 -0.146 0.089 -0.038 0.336 -0.363 0.661 0.163 0.114 0.073 0.007 0.512 0.061 0.034 0.292 -0.331 0.657 0.155 0.203 0.059 0.018 0.919 0.055 0.039 0.271 -0.336 0.646 0.155 -0.075 0.060 0.002 -0.331 0.056 0.038 0.272 -0.336 0.645 0.171 -0.013 0.069 0.000 -0.057 0.060 0.046 0.296 -0.322 0.664 0.164 0.146 0.062 0.010 0.654 0.057 0.045 0.283 -0.327 0.655 0.151 -0.046 0.063 0.001 -0.204 0.057 0.031 0.271 -0.341 0.642 0.128 -0.055 0.221 0.007 -0.269 0.107 -0.096 0.352 -0.398 0.655 0.760 0.072 0.099 0.004 0.327 0.071 0.610 0.910 0.260 1.260 0.402 0.115 0.053 0.005 0.510 0.052 0.292 0.512 -0.087 0.891 1.916 0.336 0.76810.430 4.338 0.199 1.498 2.334 1.282 2.550 0.473 0.378 0.408 1.075 2.447 0.145 0.168 0.777 -0.093 1.038 0.398 -0.129 0.102 0.014 -0.588 0.073 0.245 0.550 -0.103 0.898 1.160 -0.727 0.248 1.499 -7.317 0.113 0.923 1.398 0.628 1.693 0.137 0.004 0.131 0.000 0.018 0.082 -0.036 0.310 -0.370 0.644

Page 21: REGRESSION DIAGNOSTICS and Remedial...model. Consider the following standard multiple linear regression model: Y = β0 + β1x1 + ... random error. The square root of this statistic

Regression Diagnostics

III-39

0.375 -0.100 0.055 0.004 -0.442 0.053 0.264 0.486 -0.115 0.864 0.140 -0.005 0.123 0.000 -0.022 0.080 -0.028 0.307 -0.366 0.645 0.218 -0.043 0.074 0.001 -0.190 0.062 0.088 0.348 -0.276 0.712 ========================================================= Title: Protein Content of Ground Wheat by NIR Instrument Data File Name: PROTEIN Reference: Hand, D.J., Daly, F., Lunn, A.D., McConway, K. J., and Ostrowski, E. (1993): A handbook of small data sets. London: Chapman & Hall. Study Description: An experiment was conducted to calibrate a near infrared reflectance instrument for the measurement of protein content of ground wheat samples. The data file gives the protein content measurements of ground wheat samples made by the standard Kjeldahl method, and the six values L1-L6 are measurements of the reflectance of NIR radiation by the wheat samples at six different wavelengths in the range 1680-2310 nm. Number of Cases: 24 Variable Names and Description: Sample: Sample number Protein: Protein (%) L1: Wavelength 1680nm L2: Wavelength 1785nm L3: Wavelength 1890nm L4: Wavelength 1995nm L5: Wavelength 2100nm L6: Wavelength 2205nm =============================================== SAMPLE PRO- L1 L2 L3 L4 L5 L6 NUMBER TEIN 1 9.23 468 123 246 374 386 -11 2 8.01 458 112 236 368 383 -15 3 10.95 457 118 240 359 353 -16 4 11.67 450 115 236 352 340 -15 5 10.41 464 119 243 366 371 -16 6 9.51 499 147 273 404 433 5 7 8.67 463 119 242 370 377 -12 8 7.75 462 115 238 370 353 -13 9 8.05 488 134 258 393 377 -5 10 11.39 483 141 264 384 398 -2 11 9.95 463 120 243 367 378 -13 12 8.25 456 111 233 365 365 -15 13 10.57 512 161 288 415 443 12 14 10.23 518 167 293 421 450 19 15 11.87 552 197 324 448 467 32 16 8.09 497 146 271 407 451 11 17 12.55 592 229 360 484 524 51 18 8.38 501 150 274 406 407 11

Page 22: REGRESSION DIAGNOSTICS and Remedial...model. Consider the following standard multiple linear regression model: Y = β0 + β1x1 + ... random error. The square root of this statistic

Regression Diagnostics

III-40

19 9.64 483 137 260 385 374 -3 20 11.35 491 147 269 389 391 1 21 9.70 463 121 242 366 353 -13 22 10.75 507 159 285 410 445 13 23 10.75 474 132 255 376 383 -7 24 11.47 496 152 276 396 404 6 =============================================== Title: Hybrid Jowar crop on yield and biometrical characters The following data were collected through a pilot sample survey on Hybrid Jowar crop on yield and biometrical characters. The biometrical characters were average Plant Population (PP), average Plant Height (PH), average Number of Green Leaves (NGL) and Yield (kg/plot). (Data Courtesy: Dr Rajender Prasad, IASRI)

No. PP PH NGL Yield1 142.00 0.5250 8.20 2.4702 143.00 0.6400 9.50 4.7603 107.00 0.6600 9.30 3.3104 78.00 0.6600 7.50 1.9705 100.00 0.4600 5.90 1.3406 86.50 0.3450 6.40 1.1407 103.50 0.8600 6.40 1.5008 155.99 0.3300 7.50 2.0309 80.88 0.2850 8.40 2.540

10 109.77 0.5900 10.60 4.90011 61.77 0.2650 8.30 2.91012 79.11 0.6600 11.60 2.76013 155.99 0.4200 8.10 0.59014 61.81 0.3400 9.40 0.84015 74.50 0.6300 8.40 3.87016 97.00 0.7050 7.20 4.47017 93.14 0.6800 6.40 3.31018 37.43 0.6650 8.40 1.57019 36.44 0.2750 7.40 0.53020 51.00 0.2800 7.40 1.15021 104.00 0.2800 9.80 1.08022 49.00 0.4900 4.80 1.83023 54.66 0.3850 5.50 0.76024 55.55 0.2650 5.00 0.43025 88.44 0.9800 5.00 4.08026 99.55 0.6450 9.60 2.83027 63.99 0.6350 5.60 2.57028 101.77 0.2900 8.20 7.42029 138.66 0.7200 9.90 2.62030 90.22 0.6300 8.40 2.00031 76.92 1.2500 7.30 1.99032 126.22 0.5800 6.90 1.36033 80.36 0.6050 6.80 0.680

Page 23: REGRESSION DIAGNOSTICS and Remedial...model. Consider the following standard multiple linear regression model: Y = β0 + β1x1 + ... random error. The square root of this statistic

Regression Diagnostics

III-41

34 150.23 1.1900 8.80 5.36035 56.50 0.3550 9.70 2.12036 136.00 0.5900 10.20 4.16037 144.50 0.6100 9.80 3.12038 157.33 0.6050 8.80 2.07039 91.99 0.3800 7.70 1.17040 121.50 0.5500 7.70 3.62041 64.50 0.3200 5.70 0.67042 116.00 0.4550 6.80 3.05043 77.50 0.7200 11.80 1.70044 70.43 0.6250 10.00 1.55045 133.77 0.5350 9.30 3.28046 89.99 0.4900 9.80 2.690

Basic Regression output is given below. Participants may like to interpret these statistics. Eigenvalues of Unit Scaled X'X 1 2 3 4

3.808243889 0.102799568 0.067920477 0.021036067

Condition Indices 1 2 3 4

1.000000000 6.086487482 7.487934107 13.454888493

Variance Proportions 1 2 3 4

CONSTANT 0.002329878 0.010096602 0.0854046960.902168824

PP 0.006206273 0.120414466 0.8728064860.000572775

PH 0.008195857 0.930386837 0.0134463120.047970994

NGL 0.002648915 0.037686637 0.1191642830.840500165

Dependent Variable YIELD

N 46

Multiple R 0.488979087

Squared Multiple R 0.239100548

Adjusted Squared Multiple R 0.184750587

Standard Error of Estimate 1.331856334

Regression Coefficients B = (X'X)-1X'Y Effect Coefficient Standard Error Std. CoefficientTolerance t p-value

CONSTANT -0.848019313 1.054337278 0.000000000 . -0.804315024 0.425744487

PP 0.011995306 0.006284195 0.275081249 0.8723269851.908805382 0.063136152

PH 1.660605218 0.918813556 0.250622570 0.9421404111.807336436 0.077876127

NGL 0.151389795 0.119406278 0.178091699 0.9181835451.267854568 0.211833645

Confidence Interval for Regression Coefficients

95.0% Confidence IntervalEffect Coefficient

Lower Upper

VIF

CONSTANT -0.848019313 -2.975758119 1.279719493 .

PP 0.011995306 -0.000686714 0.024677325 1.146359125

Page 24: REGRESSION DIAGNOSTICS and Remedial...model. Consider the following standard multiple linear regression model: Y = β0 + β1x1 + ... random error. The square root of this statistic

Regression Diagnostics

III-42

Confidence Interval for Regression Coefficients

95.0% Confidence IntervalEffect Coefficient

Lower Upper

VIF

PH 1.660605218 -0.193635639 3.514846075 1.061412915

NGL 0.151389795 -0.089581834 0.392361424 1.089106863

Correlation Matrix of Regression Coefficients CONSTANT PP PH NGL

CONSTANT 1.000000000

PP -0.211620874 1.000000000

PH -0.334472739 -0.224488708 1.000000000

NGL -0.747915825 -0.273023537 -0.0218236681.000000000

Analysis of Variance Source SS df Mean SquaresF-ratio p-value

Regression 23.410859100 3 7.803619700 4.3992772770.008850553

Residual 74.501334378 42 1.773841295

WARNING Case 28 is an Outlier (Studentized Residual : 5.271341897)

Durbin-Watson D Statistic 1.675344220

First Order Autocorrelation 0.160485437

Information Criteria AIC 1.627224245E+002

AIC (Corrected) 1.642224245E+002

Schwarz's BIC 1.718656315E+002

Residuals have been saved.