1 2. linear dependent variables 2.1 the basic idea underlying linear regression 2.2 single...

11

2. Linear dependent variables2. Linear dependent variables

2.1 The basic idea underlying linear regression2.1 The basic idea underlying linear regression 2.2 Single variable OLS2.2 Single variable OLS 2.3 Correctly interpreting the coefficients2.3 Correctly interpreting the coefficients 2.4 Examining the residuals2.4 Examining the residuals 2.5 Multiple regression2.5 Multiple regression 2.6 Heteroskedasticity2.6 Heteroskedasticity 2.7 Correlated errors 2.7 Correlated errors 2.8 Multicollinearity2.8 Multicollinearity 2.9 Outlying observations2.9 Outlying observations 2.10 Median regression2.10 Median regression 2.11 “Looping”2.11 “Looping”

22

2.1 The basic idea underlying 2.1 The basic idea underlying linear regressionlinear regression

A simple linear regression aims to characterize A simple linear regression aims to characterize the relation between a dependent variable and the relation between a dependent variable and one independent variable using a straight lineone independent variable using a straight line

You have already seen how to fit a line between You have already seen how to fit a line between two variables using the two variables using the scatterscatter command command

Linear regression does the same thing but it can Linear regression does the same thing but it can be extended to include multiple independent be extended to include multiple independent variables variables

33

2.1 The basic idea2.1 The basic idea For example, you predict that larger companies For example, you predict that larger companies

usually pay higher feesusually pay higher fees You can formalize the effect of company size on You can formalize the effect of company size on

predicted fees using a simple equation:predicted fees using a simple equation:

The parameter aThe parameter a00 represents what fees are represents what fees are expected to be in the case that expected to be in the case that SizeSize = 0. = 0.

The parameter aThe parameter a11 captures the impact of an captures the impact of an increase in increase in SizeSize on expected fees. on expected fees.

44

2.1 The basic idea2.1 The basic idea

The parameters aThe parameters a00 and a and a11 are assumed to be the are assumed to be the same for all observations and they are called same for all observations and they are called “regression coefficients”“regression coefficients”

You may argue that company size is not the only You may argue that company size is not the only variable that affects audit fees. For example, variable that affects audit fees. For example, the complexity of the audit engagement, or the the complexity of the audit engagement, or the size of the audit firm may also matter.size of the audit firm may also matter.

If you do not know all the factors that influence If you do not know all the factors that influence fees, the predicted fee that you calculate from fees, the predicted fee that you calculate from the above equation will differ from the actual the above equation will differ from the actual fee.fee.

55


The deviation between the predicted fee and the The deviation between the predicted fee and the actual fee is called the “residual”. In general, you actual fee is called the “residual”. In general, you might represent the relation between actual fees might represent the relation between actual fees and predicted fees in the following way:and predicted fees in the following way:

where represents the residual term (i.e., the where represents the residual term (i.e., the difference between actual and predicted fees)difference between actual and predicted fees)

66


Putting the two together we can express Putting the two together we can express actual fees using the following equation:actual fees using the following equation:

The goal of regression analysis is to The goal of regression analysis is to estimate the parameters aestimate the parameters a00 and a and a11

77


One of the simplest techniques to estimate One of the simplest techniques to estimate the coefficients is known as ordinary least the coefficients is known as ordinary least squares (OLS).squares (OLS).

The objective of OLS is to make the The objective of OLS is to make the difference between the predicted and difference between the predicted and actual values as small as possibleactual values as small as possible

In other words, the goal is to minimize the In other words, the goal is to minimize the magnitude of the residualsmagnitude of the residuals

88

2.1 The basic idea2.1 The basic idea Go to MySiteGo to MySite

Download “ols.dta” to your hard drive and open in STATA (Download “ols.dta” to your hard drive and open in STATA (use "J:\phd\ols.dta", clearuse "J:\phd\ols.dta", clear)) examine the graphical relation between the two variables, examine the graphical relation between the two variables, twoway (scatter y x) (lfit y x)twoway (scatter y x) (lfit y x)

99


This line is fitted by minimizing the sum of the squared This line is fitted by minimizing the sum of the squared differences between the observed and predicted values differences between the observed and predicted values of y (known as the residual sum of square, RSS)of y (known as the residual sum of square, RSS)

The main assumptions required to obtain these The main assumptions required to obtain these coefficients are that:coefficients are that:

The relation between y and x is linearThe relation between y and x is linearThe x variable is uncorrelated with the residuals (i.e., x is The x variable is uncorrelated with the residuals (i.e., x is exogenous)exogenous)The residuals have a mean value of zeroThe residuals have a mean value of zero

1010

2.2 Single variable OLS (2.2 Single variable OLS (regressregress))

Instead of using the Instead of using the lfitlfit command with the command with the graph, we can instead use the graph, we can instead use the regress regress commandcommand regress y xregress y x

The first variable (y) is the dependent The first variable (y) is the dependent variable while the second (x) is the variable while the second (x) is the independent variableindependent variable

1111


This gives the following output:This gives the following output:

1212


• The coefficient estimates are 3.000 for the aThe coefficient estimates are 3.000 for the a00

parameter and 0.500 for the aparameter and 0.500 for the a11 parameter parameter

• We can use these to predict the values of Y for any We can use these to predict the values of Y for any given value of X. For example, when X = 5 we predict given value of X. For example, when X = 5 we predict that Y will be:that Y will be:

• display 3.000091+0.5000909*5display 3.000091+0.5000909*5

1313

2.2 Single variable OLS (2.2 Single variable OLS (_b[]_b[]))

Alternatively, we do not need to type the Alternatively, we do not need to type the coefficient estimates because STATA will coefficient estimates because STATA will remember them for us. They are stored by remember them for us. They are stored by STATA using the name STATA using the name _b[_b[varnamevarname] ] where where varnamevarname is replaced with the name is replaced with the name of the independent variable or the constant of the independent variable or the constant ((_cons_cons)) display _b[_cons]+_b[x]*5display _b[_cons]+_b[x]*5

1414

2.2 Single variable OLS2.2 Single variable OLS

Note that the predicted value of y when x Note that the predicted value of y when x equals 5 differs from the actual valueequals 5 differs from the actual value list y if x==5list y if x==5

The actual value is 5.68 compared to the The actual value is 5.68 compared to the predicted value of 5.50. The difference for predicted value of 5.50. The difference for this observation is the “residual” error that this observation is the “residual” error that arises because x is not a perfect predictor arises because x is not a perfect predictor of y.of y.

1515


If we want to compute the predicted value If we want to compute the predicted value of y for each value of x in our dataset, we of y for each value of x in our dataset, we can use the saved coefficientscan use the saved coefficients gen y_hat=_b[_cons]+_b[x]*xgen y_hat=_b[_cons]+_b[x]*x

The estimated residuals are the difference The estimated residuals are the difference between the observed y values and the between the observed y values and the predicted y valuespredicted y values gen y_res=y-y_hatgen y_res=y-y_hat list x y_hat y y_reslist x y_hat y y_res

1616

2.2 Single variable OLS (2.2 Single variable OLS (predictpredict))

A quicker way to do this would be to use the A quicker way to do this would be to use the predict command after regresspredict command after regress predict yhatpredict yhat predict yres, residpredict yres, resid

Checking that this gives the same answer:Checking that this gives the same answer: list yhat y_hat yres y_reslist yhat y_hat yres y_res

You should also note that the values of x, yhat You should also note that the values of x, yhat and yres correspond with those found on the and yres correspond with those found on the scatter graphscatter graph sort xsort x list x y y_hat y_reslist x y y_hat y_res

1717


1818

2.2 Single variable OLS2.2 Single variable OLS Note that by construction, there is zero correlation between the x Note that by construction, there is zero correlation between the x

variable and the residuals variable and the residuals twoway (scatter y_res x) (lfit y_res x)twoway (scatter y_res x) (lfit y_res x)

1919


Standard errorsStandard errors Typically our data comprises a sample that is Typically our data comprises a sample that is

taken from a larger populationtaken from a larger population The coefficients are only estimates of the true The coefficients are only estimates of the true

aa00 and a and a1 1 values that describe the entire values that describe the entire

populationpopulation If we obtained a second random sample from If we obtained a second random sample from

the same population, we would obtain the same population, we would obtain different coefficient estimates for adifferent coefficient estimates for a00 and a and a11

2020

2.2 Single variable OLS2.2 Single variable OLS We therefore need a way to describe the We therefore need a way to describe the

variability that would obtain if we were to apply variability that would obtain if we were to apply OLS to many different samplesOLS to many different samples

Equivalently, we want a measure of how Equivalently, we want a measure of how “precisely” our coefficients are estimated“precisely” our coefficients are estimated

The solution is to calculate “standard errors”, The solution is to calculate “standard errors”, which are simply the sample standard deviations which are simply the sample standard deviations associated with the estimated coefficientsassociated with the estimated coefficients

Standard errors (SEs) allow us to perform Standard errors (SEs) allow us to perform statistical tests, e.g., is our estimate of astatistical tests, e.g., is our estimate of a11 significantly greater than zero?significantly greater than zero?

2121


The techniques for estimating standard The techniques for estimating standard errors are based on additional OLS errors are based on additional OLS assumptionsassumptions Homoscedasticity (i.e., the residuals have a Homoscedasticity (i.e., the residuals have a

constant variance)constant variance) Non-correlation (i.e., the residuals are not Non-correlation (i.e., the residuals are not

correlated with each other)correlated with each other) Normality (i.e., the residuals are normally Normality (i.e., the residuals are normally

distributed) distributed)

2222


The t-statistic is obtained by dividing the coefficient estimate by the The t-statistic is obtained by dividing the coefficient estimate by the standard errorstandard error

2323


The p-values are from the t-distribution and they tell you how The p-values are from the t-distribution and they tell you how likely it is that you would have observed the estimated likely it is that you would have observed the estimated coefficient under the assumption that the “true” coefficient in coefficient under the assumption that the “true” coefficient in the population is zero. the population is zero.

The p-value of 0.002 tells you that it is very unlikely (prob = The p-value of 0.002 tells you that it is very unlikely (prob = 0.2%) that the true coefficient on x is zero. 0.2%) that the true coefficient on x is zero.

The confidence intervals mean you can be 95% confident that The confidence intervals mean you can be 95% confident that the true coefficient of x lies between 0.233 and 0.767.the true coefficient of x lies between 0.233 and 0.767.

2424


To explain this we need some notationTo explain this we need some notation

captures the variation in y around its meancaptures the variation in y around its mean captures the variation that is not explained by xcaptures the variation that is not explained by x captures the variation that is explained by xcaptures the variation that is explained by x

2525


The total sum of squares (TSS) = 41.27The total sum of squares (TSS) = 41.27 The explained sum of squares (ESS) = 27.51The explained sum of squares (ESS) = 27.51 The residual sum of squares (RSS) = 13.76The residual sum of squares (RSS) = 13.76 Note that TSS = ESS + RSS. Note that TSS = ESS + RSS.

2626


The column labeled df contains the number of The column labeled df contains the number of “degrees of freedom”“degrees of freedom”

For the ESS, df = k-1 where k = number of For the ESS, df = k-1 where k = number of regression coefficients (df = 2 – 1)regression coefficients (df = 2 – 1)

For the RSS, df = n – k where n = number of For the RSS, df = n – k where n = number of observations (= 11 - 2)observations (= 11 - 2)

For the TSS, df = n-1 ( = 11 – 1)For the TSS, df = n-1 ( = 11 – 1) The last column (MS) reports the ESS, RSS and The last column (MS) reports the ESS, RSS and

TSS divided by their respective degrees of TSS divided by their respective degrees of freedomfreedom

2727


The first number simply tells us how many observations The first number simply tells us how many observations are used to estimate the modelare used to estimate the model

The other statistics here tell you how “well” the model The other statistics here tell you how “well” the model explains the variation in Yexplains the variation in Y

2828


The R-squared = ESS / TSS (= 27.51 / 41.27 = 0.666)The R-squared = ESS / TSS (= 27.51 / 41.27 = 0.666) So x explains 66% of the variation in y. So x explains 66% of the variation in y. Unfortunately, many researchers in accounting (and other fields) Unfortunately, many researchers in accounting (and other fields)

evaluate the quality of a model by looking evaluate the quality of a model by looking onlyonly at the R-squared. at the R-squared. This is not only invalid it is also very dangerous (I will explain This is not only invalid it is also very dangerous (I will explain

why later)why later)

2929

2.2 Single variable OLS2.2 Single variable OLS One problem with the R-squared is that it will always One problem with the R-squared is that it will always

increase even when an independent variable is increase even when an independent variable is added that has very little explanatory power. added that has very little explanatory power.

Adding another variable is not always a good idea as you lose Adding another variable is not always a good idea as you lose one degree of freedom for each additional coefficient that needs one degree of freedom for each additional coefficient that needs to be estimated. Adding insignificant variables can be especially to be estimated. Adding insignificant variables can be especially inefficient if you are working with a small sample size.inefficient if you are working with a small sample size.

The adjusted R-squared corrects for this by The adjusted R-squared corrects for this by accounting for the number of model parameters, k, accounting for the number of model parameters, k, that need to be estimated:that need to be estimated:

Adj R-squared = 1-(1-RAdj R-squared = 1-(1-R22)(n-1)/(n-k) = 1-(1-.666)(10)/9 = 0.629)(n-1)/(n-k) = 1-(1-.666)(10)/9 = 0.629• In fact the adjusted R-squared can even take on negative In fact the adjusted R-squared can even take on negative

values. For example, suppose that y and x are uncorrelated values. For example, suppose that y and x are uncorrelated in which case the unadjusted R-squared is zero:in which case the unadjusted R-squared is zero:

• Adj R-squared = 1-(n-1)/(n-2) = (n-2-n+1)/(n-2)= -1/(n-2)Adj R-squared = 1-(n-1)/(n-2) = (n-2-n+1)/(n-2)= -1/(n-2)

3030

2.2 Single variable OLS2.2 Single variable OLS You might think that another way to measure the fit of the model is You might think that another way to measure the fit of the model is

to add up the residuals. However, by definition the residuals will to add up the residuals. However, by definition the residuals will sum to zero. sum to zero.

An alternative is to square the residuals, add them up (giving the An alternative is to square the residuals, add them up (giving the RSS) and then take the square root.RSS) and then take the square root.

Root MSE = square root of RSS/n-kRoot MSE = square root of RSS/n-k = [13.76 / (11-2)]= [13.76 / (11-2)]0.50.5 = 1.236 = 1.236 One way to interpret the root MSE is that it shows how far away on One way to interpret the root MSE is that it shows how far away on

average the model is from explaining y average the model is from explaining y The F-statistic = (ESS/k-1)/(RSS/n-k)The F-statistic = (ESS/k-1)/(RSS/n-k)

= (27.51 / 1)/(13.76/9) = 17.99= (27.51 / 1)/(13.76/9) = 17.99 the F statistic is used to test whether the R-squared is significantly the F statistic is used to test whether the R-squared is significantly

greater than zero (i.e., are the independent variables jointly significant?)greater than zero (i.e., are the independent variables jointly significant?) Prob > F gives the probability that the R-squared we calculated will be Prob > F gives the probability that the R-squared we calculated will be

observed if the true R-squared in the population is actually equal to zeroobserved if the true R-squared in the population is actually equal to zero This F test is used to test the This F test is used to test the overalloverall statistical significance of the statistical significance of the

regression modelregression model

3131

Class exercise 2aClass exercise 2a

Open your Fees.dta file and run the Open your Fees.dta file and run the following two regressions: following two regressions: audit fees on total assets audit fees on total assets the log of audit fees on the log of total assetsthe log of audit fees on the log of total assets

What does the output of your regression What does the output of your regression mean?mean?

Which model appears to have the better Which model appears to have the better “fit”“fit”

3232

2.3 Correctly interpreting the 2.3 Correctly interpreting the coefficientscoefficients

So far we have considered the case where So far we have considered the case where the independent variable is continuous.the independent variable is continuous.

Interpretation of results is even more Interpretation of results is even more straightforward when the independent straightforward when the independent variable is a dummy.variable is a dummy.

reg auditfees big6reg auditfees big6 ttest auditfees, by(big6)ttest auditfees, by(big6)

3333


Suppose we wish to test whether the Big 6 Suppose we wish to test whether the Big 6 fee premium is significantly different fee premium is significantly different between listed and non-listed companiesbetween listed and non-listed companies

3434


gen listed=0 gen listed=0 replace listed=1 if companytype==2 | companytype==3 | replace listed=1 if companytype==2 | companytype==3 |

companytype==5companytype==5 reg auditfees big6 if listed==0reg auditfees big6 if listed==0 ttest auditfees if listed==0, by(big6)ttest auditfees if listed==0, by(big6) reg auditfees big6 if listed==1reg auditfees big6 if listed==1 ttest auditfees if listed==1, by(big6)ttest auditfees if listed==1, by(big6) gen listed_big6=listed*big6 gen listed_big6=listed*big6 reg auditfees big6 listed listed_big6reg auditfees big6 listed listed_big6

3535


Some studies report the “economic” significance Some studies report the “economic” significance of the estimated coefficients as well as the of the estimated coefficients as well as the statistical significancestatistical significance

Economic significance refers to the magnitude of Economic significance refers to the magnitude of the impact of x on ythe impact of x on y

There is no single way to evaluate “economic There is no single way to evaluate “economic significance” but many studies describe the significance” but many studies describe the change in the predicted value of y as x change in the predicted value of y as x increases from the 25increases from the 25thth percentile to the 75 percentile to the 75thth (or (or as x changes by one standard deviation around as x changes by one standard deviation around its mean)its mean)

3636


For example, we can calculate the expected For example, we can calculate the expected change in audit fees as company size increases change in audit fees as company size increases from the 25from the 25thth to 75 to 75thth percentiles percentiles reg auditfees totalassets reg auditfees totalassets sum totalassets if auditfees<., detail sum totalassets if auditfees<., detail gen fees_low=_b[_cons]+_b[totalassets]*r(p25)gen fees_low=_b[_cons]+_b[totalassets]*r(p25) gen fees_high=_b[_cons]+_b[totalassets]*r(p75)gen fees_high=_b[_cons]+_b[totalassets]*r(p75) sum fees_low fees_highsum fees_low fees_high

3737

Class exercise 2bClass exercise 2b Estimate the audit fee model in logs rather than Estimate the audit fee model in logs rather than

in absolute valuesin absolute values Calculate the expected change in audit fees as Calculate the expected change in audit fees as

company size increases from the 25company size increases from the 25 thth to 75 to 75thth percentilespercentiles

Compare your results for economic significance Compare your results for economic significance to those we obtained when the fee model was to those we obtained when the fee model was estimated using the absolute values of fees and estimated using the absolute values of fees and assets.assets.

Hint: you will need to take the exponential of the Hint: you will need to take the exponential of the predicted log of fees in order to make this predicted log of fees in order to make this comparison.comparison.

3838


When evaluating the economic significance of a When evaluating the economic significance of a dummy variable coefficient, we usually do so dummy variable coefficient, we usually do so using the values zero and one rather than using the values zero and one rather than percentilespercentiles

For exampleFor example reg lnaf big6reg lnaf big6 gen fees_nb6=exp(_b[_cons])gen fees_nb6=exp(_b[_cons]) gen fees_b6=exp(_b[_cons]+_b[big6])gen fees_b6=exp(_b[_cons]+_b[big6]) sum fees_nb6 fees_b6sum fees_nb6 fees_b6

3939


Suppose we believe that the impact of a Big 6 audit on Suppose we believe that the impact of a Big 6 audit on fees depends upon the size of the companyfees depends upon the size of the company

Usually, we would quantify this impact using a range of Usually, we would quantify this impact using a range of values for lnta (e.g., as lnta increases from the 25values for lnta (e.g., as lnta increases from the 25thth to the to the 7575thth percentile) percentile)

4040


For example:For example: gen big6_lnta= big6*lntagen big6_lnta= big6*lnta reg lnaf big6 lnta big6_lntareg lnaf big6 lnta big6_lnta sum lnta if lnaf<. & big6<., detailsum lnta if lnaf<. & big6<., detail gen big6_low=_b[big6]+_b[big6_lnta]*r(p25)gen big6_low=_b[big6]+_b[big6_lnta]*r(p25) gen big6_high=_b[big6]+_b[big6_lnta]*r(p75)gen big6_high=_b[big6]+_b[big6_lnta]*r(p75) gen big6_mean=_b[big6]+_b[big6_lnta]*r(mean)gen big6_mean=_b[big6]+_b[big6_lnta]*r(mean) sum big6_low big6_high big6_mean sum big6_low big6_high big6_mean

4141

It is amazing how many studies give a misleading It is amazing how many studies give a misleading interpretation of the coefficients when using interaction interpretation of the coefficients when using interaction terms. For example, Blackwell et al. terms. For example, Blackwell et al.

4343

Class questions: Class questions: Theoretically, how should auditing affect the interest Theoretically, how should auditing affect the interest

rate that the company has to pay?rate that the company has to pay? Empirically, how do we measure the impact of Empirically, how do we measure the impact of

auditing on the interest rate using eq. (1)? auditing on the interest rate using eq. (1)?

4545

Class question: At what values of total assets Class question: At what values of total assets ($000) is the effect of the Audit Dummy on the ($000) is the effect of the Audit Dummy on the interest rate:interest rate: negative, zero, positive? negative, zero, positive?

4747

Class questions: Class questions: What is the mean value of total assets within their What is the mean value of total assets within their

sample?sample? How does auditing affect the interest rate for the How does auditing affect the interest rate for the

average company in their sample?average company in their sample?

4949

Verify that the above claim is “true”.Verify that the above claim is “true”. Suppose Blackwell et al. had reported the Suppose Blackwell et al. had reported the

impact for a firm with $11m in assets and impact for a firm with $11m in assets and another firm with $15m in assets. another firm with $15m in assets. How would this have changed the conclusions How would this have changed the conclusions

drawn?drawn? Do you think the paper would have been Do you think the paper would have been

published if the authors had made this published if the authors had made this comparison?comparison?

5151

2.4 Examining the residuals2.4 Examining the residuals Go to MySiteGo to MySite Download “anscombe.dta” to your hard drive Download “anscombe.dta” to your hard drive

use "J:\phd\anscombe.dta", clearuse "J:\phd\anscombe.dta", clear Run the following regressionsRun the following regressions

• reg y1 x1reg y1 x1• reg y2 x2reg y2 x2• reg y3 x3reg y3 x3• reg y4 x4reg y4 x4

Note that the output from these regressions is virtually Note that the output from these regressions is virtually identical identical

intercept = 3.0 (t-stat=2.67) intercept = 3.0 (t-stat=2.67) x coefficient = 0.5 (t-stat=4.24)x coefficient = 0.5 (t-stat=4.24) R-squared = 66%R-squared = 66%

5252

Class exercise 2cClass exercise 2c If you did not know about regression assumptions or If you did not know about regression assumptions or

regression diagnostics you would probably stop your regression diagnostics you would probably stop your analysis at this point, concluding that you have a good fit analysis at this point, concluding that you have a good fit for all four models.for all four models.

In fact, only one of these four models is well specified.In fact, only one of these four models is well specified. Draw scatter graphs for each of these four associations Draw scatter graphs for each of these four associations

(e.g., (e.g., twoway (scatter y1 x1) (lfit y1 x1)twoway (scatter y1 x1) (lfit y1 x1)).). Of the four models, which do you think is the well Of the four models, which do you think is the well

specified one? specified one? Draw scatter graphs for the residuals against the x Draw scatter graphs for the residuals against the x

variable for each of the four regressions – is there a variable for each of the four regressions – is there a pattern?pattern?

Which of the OLS assumptions are violated in these four Which of the OLS assumptions are violated in these four regressions?regressions?

5353

2.4 Examining the residuals2.4 Examining the residuals

Unfortunately, accounting researchers Unfortunately, accounting researchers often judge whether a model is “well-often judge whether a model is “well-specified” solely in terms of its explanatory specified” solely in terms of its explanatory power (i.e., the R-squared).power (i.e., the R-squared).

Many researchers fail to report other types Many researchers fail to report other types of diagnostic tests of diagnostic tests is there significant heteroscedasticity? is there significant heteroscedasticity? is there any pattern to the residuals?is there any pattern to the residuals? are there any problems of outliers?are there any problems of outliers?

5454

2.4 Examining the residuals2.4 Examining the residuals

For example, many audit fee studies claim For example, many audit fee studies claim that their models are well-specified that their models are well-specified because they have high Rbecause they have high R22

Carson et al. (2003):Carson et al. (2003):

5555

2.4 Examining the residuals2.4 Examining the residuals Gu (2007) points out that:Gu (2007) points out that:

econometricians consider Reconometricians consider R22 values to be relatively unimportant values to be relatively unimportant (accounting researchers put far too much emphasis on the (accounting researchers put far too much emphasis on the magnitude of the Rmagnitude of the R22))

regression Rregression R22s should not be compared across different samples s should not be compared across different samples • in contrast there is a large accounting literature that uses Rin contrast there is a large accounting literature that uses R22s s

to determine whether the value relevance of accounting to determine whether the value relevance of accounting information has changed over timeinformation has changed over time

5656

Using either eq. (1) or (2), we will obtain exactly the same coefficient Using either eq. (1) or (2), we will obtain exactly the same coefficient estimates because the economic model is the same estimates because the economic model is the same

If eq. (1) is well-specified, so also is eq. (2)If eq. (1) is well-specified, so also is eq. (2) If eq. (1) is mis-specified, so also is eq. (2)If eq. (1) is mis-specified, so also is eq. (2)

However, the RHowever, the R22 of eq. (1) will be very different from the R of eq. (1) will be very different from the R22 of eq. (2) of eq. (2)

It is easy to show that the same “economic” model can yield very It is easy to show that the same “economic” model can yield very different Rdifferent R22 depending on how the variables are transformed: depending on how the variables are transformed:

5757

Example:Example: use "J:\phd\Fees.dta", clearuse "J:\phd\Fees.dta", clear gen lnaf=ln(auditfees)gen lnaf=ln(auditfees) gen lnta=ln(totalassets)gen lnta=ln(totalassets) sort companyid yearendsort companyid yearend by companyid: gen lnaf_lag=lnaf[_n-1]by companyid: gen lnaf_lag=lnaf[_n-1] egen miss=rmiss(lnaf lnta lnaf_lag)egen miss=rmiss(lnaf lnta lnaf_lag) gen chlnaf=lnaf-lnaf_laggen chlnaf=lnaf-lnaf_lag reg lnaf lnta lnaf_lag if miss==0reg lnaf lnta lnaf_lag if miss==0 reg chlnaf lnta lnaf_lag if miss==0reg chlnaf lnta lnaf_lag if miss==0

The lnta coefficients are exactly the same in the two models.The lnta coefficients are exactly the same in the two models. The lnaf_lag coefficient in eq. (2) equals the lnaf_lag coefficient in The lnaf_lag coefficient in eq. (2) equals the lnaf_lag coefficient in

eq. (1) minus one.eq. (1) minus one. The RThe R22 is much higher in eq. (1) than eq. (2). is much higher in eq. (1) than eq. (2). The high RThe high R22 in eq. (1) does not imply that the model is well- in eq. (1) does not imply that the model is well-

specified.specified. The low RThe low R22 in eq. (2) does not imply that the model is mis-specified. in eq. (2) does not imply that the model is mis-specified. Either both equations are well-specified or they are both mis-Either both equations are well-specified or they are both mis-

specified.specified. The RThe R22 tells us tells us nothingnothing about whether our hypothesis about the about whether our hypothesis about the

determinants of Y is correct.determinants of Y is correct.

5858

2.4 Examining the residuals2.4 Examining the residuals Instead of relying only on the RInstead of relying only on the R22, an examination of the , an examination of the

residuals can help us identify whether the model is well residuals can help us identify whether the model is well specified. For example compare the audit fee model specified. For example compare the audit fee model which is not logged:which is not logged:

reg auditfees totalassetsreg auditfees totalassets predict res1, resid predict res1, resid twoway (scatter res1 totalassets, msize(tiny)) (lfit res1 twoway (scatter res1 totalassets, msize(tiny)) (lfit res1

totalassets)totalassets) With the logged audit fee model With the logged audit fee model

reg lnaf lntareg lnaf lnta predict res2, resid predict res2, resid twoway (scatter res2 lnta, msize(tiny)) (lfit res2 lnta)twoway (scatter res2 lnta, msize(tiny)) (lfit res2 lnta)

Notice that the residuals are more “spherical” displaying Notice that the residuals are more “spherical” displaying less of an obvious pattern in the logged model. less of an obvious pattern in the logged model.

5959

2.4 Examining the residuals2.4 Examining the residuals In order to obtain unbiased standard errors we have to In order to obtain unbiased standard errors we have to

assume that the residuals are normally distributedassume that the residuals are normally distributed We can test this using a histogram of the residualsWe can test this using a histogram of the residuals

hist res1hist res1 this does not give us what we need because there are severe this does not give us what we need because there are severe

outliersoutliers sum res1, detailsum res1, detail hist res1 if res1>-22 & res1<208, normal xlabel(-25(25)210)hist res1 if res1>-22 & res1<208, normal xlabel(-25(25)210) hist res2hist res2 sum res2, detailsum res2, detail hist res2 if res2>-2 & res2<1.8, normal xlabel(-2(0.5)2)hist res2 if res2>-2 & res2<1.8, normal xlabel(-2(0.5)2)

The residuals are much closer to the assumed normal The residuals are much closer to the assumed normal distribution when the variables are measured in logsdistribution when the variables are measured in logs

6161

Class exercise 2dClass exercise 2d

Following Pong and Whittington (1994) Following Pong and Whittington (1994) estimate the raw value of audit fees as a estimate the raw value of audit fees as a function of raw assets and assets squaredfunction of raw assets and assets squared

Examine the residualsExamine the residuals Do you think this model is better specified Do you think this model is better specified

than the one in logs? than the one in logs?

6262

2.5 Multiple regression2.5 Multiple regression

Researchers use “multiple regression” Researchers use “multiple regression” when they believe that Y is affected by when they believe that Y is affected by multiple independent variables:multiple independent variables: Y = aY = a00 + a + a11 X X11 + a + a22 X X22 + e + e

Why is it important to control for multiple Why is it important to control for multiple factors that influence Y?factors that influence Y?

6363

2.5 Multiple regression2.5 Multiple regression Suppose the “true” model is:Suppose the “true” model is:

Y = aY = a00 + a + a11 X X11 + a + a22 X X22 + e + e where Xwhere X11 and X and X22 is uncorrelated with the error, e is uncorrelated with the error, e

Suppose the OLS model that we estimate is:Suppose the OLS model that we estimate is: Y = aY = a00 + a + a11 X X11 + u + u where u = awhere u = a22 X X22 + e + e

OLS imposes the assumption that XOLS imposes the assumption that X11 is is uncorrelated with the residual term, u. uncorrelated with the residual term, u.

Since XSince X11 is uncorrelated with e, the assumption is uncorrelated with e, the assumption that Xthat X11 is uncorrelated with u is equivalent to is uncorrelated with u is equivalent to assuming that Xassuming that X11 is uncorrelated X is uncorrelated X22. .

6464


If XIf X11 is correlated with X is correlated with X2 2 the OLS estimate of athe OLS estimate of a11

is biased. is biased. The magnitude of the bias depends upon the The magnitude of the bias depends upon the

strength of the correlation between Xstrength of the correlation between X11 and X and X22. .

Of course, we often do not know whether the Of course, we often do not know whether the model we estimate is the “true” modelmodel we estimate is the “true” model

In other words, we are unsure whether there is In other words, we are unsure whether there is an omitted variable (Xan omitted variable (X22) that affects Y and that is ) that affects Y and that is

correlated with our variable of interest (Xcorrelated with our variable of interest (X11))

6565


We can judge whether or not there is likely We can judge whether or not there is likely to be a correlated omitted variable to be a correlated omitted variable problem using:problem using: theory theory prior empirical studies prior empirical studies

6666

2.5 Multiple regression2.5 Multiple regression Previously, when we were using simple regression with Previously, when we were using simple regression with

one independent variable, we checked whether there one independent variable, we checked whether there was a pattern between the residuals and the was a pattern between the residuals and the independent variableindependent variable

lnaf = alnaf = a00 + a + a11 lnta + res1 lnta + res1 twoway (scatter res1 lnta) (lfit res1 lnta)twoway (scatter res1 lnta) (lfit res1 lnta)

When we are using multiple regression, we want to test When we are using multiple regression, we want to test whether there is a pattern between the residuals and the whether there is a pattern between the residuals and the right hand side of the equation right hand side of the equation as a wholeas a whole

The right hand side of the equation “as a whole” is the The right hand side of the equation “as a whole” is the same thing as the predicted value of the dependent same thing as the predicted value of the dependent variablevariable

6767

2.5 Multiple regression2.5 Multiple regression So we should examine whether there is a pattern So we should examine whether there is a pattern

between the residuals and the predicted values of the between the residuals and the predicted values of the dependent variabledependent variable

For example, let’s estimate a model where audit fees For example, let’s estimate a model where audit fees depend on company size, audit firm size, and whether depend on company size, audit firm size, and whether the company is listed on a stock marketthe company is listed on a stock market

gen listed=0 gen listed=0 replace listed=1 if companytype==2 | companytype==3 | replace listed=1 if companytype==2 | companytype==3 |

companytype==5companytype==5 reg lnaf lnta big6 listed reg lnaf lnta big6 listed predict lnaf_hatpredict lnaf_hat predict lnaf_res, resid predict lnaf_res, resid twoway (scatter lnaf_res lnaf_hat) (lfit lnaf_res lnaf_hat)twoway (scatter lnaf_res lnaf_hat) (lfit lnaf_res lnaf_hat)

6868

2.5 Multiple regression (2.5 Multiple regression (rvfplotrvfplot))

In fact, those nice guys at STATA have In fact, those nice guys at STATA have given us a command which enables us to given us a command which enables us to short-cut having to use the predict short-cut having to use the predict command for calculating the residuals and command for calculating the residuals and the fitted valuesthe fitted values reg lnaf lnta big6 listed reg lnaf lnta big6 listed rvfplotrvfplot rvf stands for residuals versus fittedrvf stands for residuals versus fitted

6969

2.6 Heteroscedasticity (2.6 Heteroscedasticity (hettesthettest)) The OLS techniques for estimating standard errors are The OLS techniques for estimating standard errors are

based on an assumption that the variance of the errors is based on an assumption that the variance of the errors is the same for all values of the independent variables the same for all values of the independent variables (homoscedasticity)(homoscedasticity)

In many cases, the homoscedasticity assumption is clearly In many cases, the homoscedasticity assumption is clearly violated. For example:violated. For example:

• reg auditfees nonauditfees totalassets big6 listed reg auditfees nonauditfees totalassets big6 listed • rvfplotrvfplot

the homoscedasticity assumption can be tested using the hettest the homoscedasticity assumption can be tested using the hettest command after we do the regressioncommand after we do the regression

• reg auditfees nonauditfees totalassets big6 listed reg auditfees nonauditfees totalassets big6 listed • hettesthettest

Heteroscedasticity does not bias the coefficient Heteroscedasticity does not bias the coefficient estimates but it does bias the standard errors of the estimates but it does bias the standard errors of the coefficientscoefficients

7070

2.6 Heteroscedasticity (2.6 Heteroscedasticity (robustrobust)) Heteroscedasticity is often caused by using a dependent Heteroscedasticity is often caused by using a dependent

variable that is not symmetricvariable that is not symmetric for example the auditfees variable is highly skewed due to for example the auditfees variable is highly skewed due to

the fact that it has a lower bound of zerothe fact that it has a lower bound of zero much of the heterosedasticity can often be removed by much of the heterosedasticity can often be removed by

transforming the dependent variable (e.g., use the log of transforming the dependent variable (e.g., use the log of audit fees instead of the raw values) audit fees instead of the raw values)

When you find that there is heteroscedasticity, you need to When you find that there is heteroscedasticity, you need to adjust the standard errors using the Huber/White/sandwich adjust the standard errors using the Huber/White/sandwich estimatorestimator

In STATA it is easy to do this adjustment using the In STATA it is easy to do this adjustment using the robustrobust optionoption reg auditfees nonauditfees totalassets big6 listed, robustreg auditfees nonauditfees totalassets big6 listed, robust

Compare the adjusted and unadjusted resultsCompare the adjusted and unadjusted results reg auditfees nonauditfees totalassets big6 listedreg auditfees nonauditfees totalassets big6 listed What is different? What is the same?What is different? What is the same?

7171

Class exercise 2eClass exercise 2e

Esimate the audit fee model in logs rather than Esimate the audit fee model in logs rather than absolute valuesabsolute values

Using Using rvfplotrvfplot, assess whether the residuals , assess whether the residuals appear to be non-constantappear to be non-constant

Using Using hettesthettest, provide a formal test for , provide a formal test for heteroscedasticityheteroscedasticity

Compare the coefficients and t-statistics when Compare the coefficients and t-statistics when you estimate the standard errors with and you estimate the standard errors with and without adjusting for heteroscedasticity.without adjusting for heteroscedasticity.

7272

2.7 Correlated errors2.7 Correlated errors

The OLS techniques for estimating standard The OLS techniques for estimating standard errors are based on an assumption that the errors are based on an assumption that the errors are not correlatederrors are not correlated

This assumption is typically violated when we This assumption is typically violated when we use repeated annual observations on the same use repeated annual observations on the same companies companies

The residuals of a given firm are correlated The residuals of a given firm are correlated across years (“time series dependence”)across years (“time series dependence”)

7373

Time-series dependenceTime-series dependence

Time-series dependence is nearly Time-series dependence is nearly always a problem when always a problem when researchers use “panel data”researchers use “panel data”

Panel data = data that are pooled Panel data = data that are pooled for the same companies across for the same companies across timetime

In panel data, there are likely to be In panel data, there are likely to be unobserved company-specific unobserved company-specific characteristics that are relatively characteristics that are relatively constant over timeconstant over time

CompanyCompany YearYear

AA 19961996

AA 19971997

BB 19961996

BB 19971997

7474

Let’s start with a simple regression model where the Let’s start with a simple regression model where the errors are assumed to be uncorrelatederrors are assumed to be uncorrelated

We now relax the assumption of independent errors by We now relax the assumption of independent errors by assuming that the error term has an unobserved assuming that the error term has an unobserved company-specific component that does not vary over company-specific component that does not vary over time and an idiosyncratic component that is unique to time and an idiosyncratic component that is unique to each company-year observation:each company-year observation:

Similarly, we can assume that the X variable has a Similarly, we can assume that the X variable has a company-specific component that does not vary over company-specific component that does not vary over time and an idiosyncratic component:time and an idiosyncratic component:

7575

Time-series dependenceTime-series dependence In this case, the OLS standard errors tend to be biased In this case, the OLS standard errors tend to be biased

downwards and the magnitude of this bias is increasing in the downwards and the magnitude of this bias is increasing in the number of years within the panel.number of years within the panel.

To understand the intuition, consider the extreme case where To understand the intuition, consider the extreme case where the residuals and independent variables are perfectly the residuals and independent variables are perfectly correlated across time.correlated across time.

In this case, each additional year provides no additional In this case, each additional year provides no additional information and will have no effect on the information and will have no effect on the truetrue standard error standard error

However, under the incorrect assumption of time-series However, under the incorrect assumption of time-series independence, it is assumed that each additional year independence, it is assumed that each additional year provides additional observations and the estimated standard provides additional observations and the estimated standard errors will shrink accordingly and incorrectlyerrors will shrink accordingly and incorrectly

This problem can be avoided by adjusting the standard errors This problem can be avoided by adjusting the standard errors for the clustering of yearly observations across a given for the clustering of yearly observations across a given companycompany

7676

Time-series dependenceTime-series dependence

To understand all this, it is helpful to review the To understand all this, it is helpful to review the following examplefollowing example First, I estimate the model using just one observation First, I estimate the model using just one observation

for each company (in the year 1998)for each company (in the year 1998)• gen fye=date(yearend, "MDY")gen fye=date(yearend, "MDY")• gen year=year(fye)gen year=year(fye)• drop if year!=1998drop if year!=1998• sort companyid sort companyid • drop if companyid==companyid[_n-1]drop if companyid==companyid[_n-1]• reg lnaf lnta big6 listed, robustreg lnaf lnta big6 listed, robust

7777

Time-series dependenceTime-series dependence Now I create a dataset in which each observation is Now I create a dataset in which each observation is

duplicatedduplicated Each duplicated observation provides no additional Each duplicated observation provides no additional

information and will have no effect on the information and will have no effect on the truetrue standard standard error but it will reduce the estimated standard error (i.e., error but it will reduce the estimated standard error (i.e., the estimated standard error will be biased downwards)the estimated standard error will be biased downwards) save "J:\phd\Fees98.dta", replacesave "J:\phd\Fees98.dta", replace append using "J:\phd\Fees98.dta"append using "J:\phd\Fees98.dta" reg lnaf lnta big6 listed, robustreg lnaf lnta big6 listed, robust

What’s happened to the coefficients and t-statistics? What’s happened to the coefficients and t-statistics?

7878

Time-series dependence Time-series dependence ((robust cluster()robust cluster()))

We can obtain correct standard errors in the We can obtain correct standard errors in the duplicate dataset using the duplicate dataset using the robust cluster()robust cluster() option which adjusts the standard errors for option which adjusts the standard errors for clustering of observations (here they are clustering of observations (here they are duplicated) for each companyduplicated) for each company reg lnaf lnta big6 listed, robust cluster (companyid)reg lnaf lnta big6 listed, robust cluster (companyid)

What’s happened to the coefficients and t-What’s happened to the coefficients and t-statistics?statistics?

7979

Time-series dependenceTime-series dependence In reality the observations of a given company In reality the observations of a given company

are not exactly the same from one year to the are not exactly the same from one year to the next (i.e., they are not exact duplicates). next (i.e., they are not exact duplicates).

However, the observations of a given company However, the observations of a given company often do not change much from one year to the often do not change much from one year to the next.next.

For example, a company’s size and the fees that For example, a company’s size and the fees that it pays may not change much over time (i.e., it pays may not change much over time (i.e., there is a strong unobserved company-specific there is a strong unobserved company-specific component to the variables).component to the variables).

Failing to account for this in panel data tends to Failing to account for this in panel data tends to overstate the magnitude of the t-statistics.overstate the magnitude of the t-statistics.

8080

Time-series dependenceTime-series dependence It is easy to demonstrate that the residuals of a given It is easy to demonstrate that the residuals of a given

company tend to be very highly correlated over timecompany tend to be very highly correlated over time First, start again with the original dataFirst, start again with the original data

use "J:\phd\Fees.dta", clearuse "J:\phd\Fees.dta", clear gen fye=date(yearend, "MDY")gen fye=date(yearend, "MDY") gen year=year(fye)gen year=year(fye) gen lnaf=ln(auditfees)gen lnaf=ln(auditfees) gen lnta=ln(totalassets)gen lnta=ln(totalassets) save "J:\phd\Fees1.dta", replacesave "J:\phd\Fees1.dta", replace

Estimate the fee model and obtain the residuals for each Estimate the fee model and obtain the residuals for each company-year observationcompany-year observation

reg lnaf lntareg lnaf lnta predict res, residpredict res, resid

8181

Time-series dependenceTime-series dependence Reshape the data so that we have each Reshape the data so that we have each

company as a row and there are separate company as a row and there are separate variables for each yearly set of residualsvariables for each yearly set of residuals keep companyid year reskeep companyid year res sort companyid yearsort companyid year drop if companyid==companyid[_n-1] & drop if companyid==companyid[_n-1] &

year==year[_n-1]year==year[_n-1] reshape wide res, i( companyid) j(year)reshape wide res, i( companyid) j(year) browsebrowse

Examine the correlations between the residuals Examine the correlations between the residuals of a given company of a given company pwcorr res1998- res2002pwcorr res1998- res2002

8282

Time-series dependenceTime-series dependence We can easily control for this problem of time-series We can easily control for this problem of time-series

dependence using the dependence using the robust cluster()robust cluster() option option use "J:\phd\Fees1.dta", clearuse "J:\phd\Fees1.dta", clear reg lnaf lnta, robust cluster(companyid)reg lnaf lnta, robust cluster(companyid)

Note that if we do not control for time-series Note that if we do not control for time-series dependence, the t-statistic is biased upwards even dependence, the t-statistic is biased upwards even though we have controlled for the heteroscedasticity though we have controlled for the heteroscedasticity

reg lnaf lnta, robust reg lnaf lnta, robust If we do not control for heteroscedasticity, the upward If we do not control for heteroscedasticity, the upward

bias would be even worsebias would be even worse reg lnaf lntareg lnaf lnta

TOP TIP: Whenever you use panel data you should get TOP TIP: Whenever you use panel data you should get into the habit of using the into the habit of using the robust cluster()robust cluster() option, option, otherwise your “significant” results from pooled otherwise your “significant” results from pooled regressions may be spurious.regressions may be spurious.

8383

2.8 Multicollinearity2.8 Multicollinearity Perfect collinearity occurs if there is a perfect Perfect collinearity occurs if there is a perfect

linear relation between multiple variables of the linear relation between multiple variables of the regression model.regression model.

For example, our dataset covers a sample For example, our dataset covers a sample period of five years (1998-2002). Suppose we period of five years (1998-2002). Suppose we create a dummy for each year and include all create a dummy for each year and include all five year dummies in the fee regression.five year dummies in the fee regression. tabulate year, gen(year_)tabulate year, gen(year_) reg lnaf year_1 year_2 year_3 year_4 year_5reg lnaf year_1 year_2 year_3 year_4 year_5

STATA excludes one of the year dummies when STATA excludes one of the year dummies when estimating the model – why is that? estimating the model – why is that?

8484

2.8 Multicollinearity2.8 Multicollinearity

The reason is that a linear combination of the The reason is that a linear combination of the year dummies equals the constant in the modelyear dummies equals the constant in the model year_1 + year_2 + year_3 + year_4 + year_5 = 1 year_1 + year_2 + year_3 + year_4 + year_5 = 1 where 1 is a constantwhere 1 is a constant

The model can only be estimated if one of the The model can only be estimated if one of the year dummies or the constant is excludedyear dummies or the constant is excluded reg lnaf year_1 year_2 year_3 year_4 year_5, noconsreg lnaf year_1 year_2 year_3 year_4 year_5, nocons

STATA automatically throws away one of the STATA automatically throws away one of the year dummies so that the model can be year dummies so that the model can be estimatedestimated

8585

Class exercise 2fClass exercise 2f Go to MySiteGo to MySite Download “international.dta” to your hard drive and Download “international.dta” to your hard drive and

open in STATAopen in STATA You are interested in testing whether legal enforcement You are interested in testing whether legal enforcement

affects the importance of equity markets to the economyaffects the importance of equity markets to the economy Create dummy variables for each country in your datasetCreate dummy variables for each country in your dataset Run a regression where importanceofequitymarket is the Run a regression where importanceofequitymarket is the

dependent variable and legalenforcement is the dependent variable and legalenforcement is the independent variableindependent variable

How many country dummies can be included in your How many country dummies can be included in your regression? Explain.regression? Explain.

Are your results for the legalenforcement coefficient Are your results for the legalenforcement coefficient sensitive to your choice for which country dummies to sensitive to your choice for which country dummies to exclude? Explain.exclude? Explain.

Golden rule:Golden rule:

You must not include dummies as control You must not include dummies as control variables if their inclusion would “dummy variables if their inclusion would “dummy out” all of the variation in your treatment out” all of the variation in your treatment

variable.variable.

8686

8787


We have seen that when there is perfect We have seen that when there is perfect collinearity between independent variables, collinearity between independent variables, STATA will have to exclude one of them.STATA will have to exclude one of them. For example, a linear combination of all year dummies For example, a linear combination of all year dummies

equals the constant in the modelequals the constant in the model year_1 + year_2 + year_3 + year_4 + year_5 = constant year_1 + year_2 + year_3 + year_4 + year_5 = constant so we cannot estimate a model that includes all the year so we cannot estimate a model that includes all the year

dummies and the constant termdummies and the constant term

Even if the independent variables are not Even if the independent variables are not perfectly collinear, there can still be a problem if perfectly collinear, there can still be a problem if they are highly correlatedthey are highly correlated

8888

2.8 Multicollinearity2.8 Multicollinearity Multicollinearity can cause: Multicollinearity can cause:

the standard errors of the coefficients to be large (i.e., the standard errors of the coefficients to be large (i.e., the coefficients are not estimated precisely)the coefficients are not estimated precisely)

the coefficient estimates can be highly unstable the coefficient estimates can be highly unstable Example:Example:

use "J:\phd\Fees.dta", clearuse "J:\phd\Fees.dta", clear gen lnaf=ln(auditfees)gen lnaf=ln(auditfees) gen lnta=ln(totalassets)gen lnta=ln(totalassets) gen lnta1=lnta gen lnta1=lnta reg lnaf lnta lnta1 reg lnaf lnta lnta1

Obviously, you must exclude one of these Obviously, you must exclude one of these variables because lnta and lnta1 are perfectly variables because lnta and lnta1 are perfectly correlatedcorrelated

8989

2.8 Multicollinearity2.8 Multicollinearity Let’s see what happens if we change the value of lnta1 Let’s see what happens if we change the value of lnta1

for just one observationfor just one observation list lnta if _n==1list lnta if _n==1 replace lnta1=8 if _n==1replace lnta1=8 if _n==1 reg lnaf lnta reg lnaf lnta reg lnaf lnta1reg lnaf lnta1 reg lnaf lnta lnta1reg lnaf lnta lnta1

Notice that the lnta and lnta1 coefficients are highly Notice that the lnta and lnta1 coefficients are highly significant when included separately but they are significant when included separately but they are insignificant when included togetherinsignificant when included together

The reason of course is that, by construction, lnta and The reason of course is that, by construction, lnta and lnta1 are very highly correlatedlnta1 are very highly correlated

pwcorr lnta lnta1, sigpwcorr lnta lnta1, sig

9090


As another example, we can see that the As another example, we can see that the coefficients can “flip” signs as a result of coefficients can “flip” signs as a result of high collinearityhigh collinearity sort lnaf lntasort lnaf lnta replace lnta1=10 if _n<=100replace lnta1=10 if _n<=100 reg lnaf lnta reg lnaf lnta reg lnaf lnta1 reg lnaf lnta1 reg lnaf lnta lnta1reg lnaf lnta lnta1 pwcorr lnta lnta1, sigpwcorr lnta lnta1, sig

9191

2.8 Multicollinearity (2.8 Multicollinearity (vifvif)) Variance-inflation factors (VIF) can be used to assess Variance-inflation factors (VIF) can be used to assess

whether multicollinearity is a problem for a particular whether multicollinearity is a problem for a particular independent variableindependent variable

The VIF takes account of the variable’s correlations with The VIF takes account of the variable’s correlations with all other independent variables on the right hand side all other independent variables on the right hand side

The VIF shows the increase in the variance of the The VIF shows the increase in the variance of the coefficient estimate that is attributable to the variable’s coefficient estimate that is attributable to the variable’s correlations with other independent variables in the correlations with other independent variables in the model model

reg lnaf lnta big6 lnta1 reg lnaf lnta big6 lnta1 vifvif reg lnaf lnta big6reg lnaf lnta big6 vif vif

Multicollinearity is generally regarded as high (very high) Multicollinearity is generally regarded as high (very high) if the VIF is greater than 10 (20)if the VIF is greater than 10 (20)

9292

2.9 Outlying observations2.9 Outlying observations We have already seen that outlying observations We have already seen that outlying observations

heavily influence the results of OLS modelsheavily influence the results of OLS models

9393

2.9 Outlying observations2.9 Outlying observations In simple regression (with just one independent In simple regression (with just one independent

variable), it is easy to spot outliers from a scatterplot of Y variable), it is easy to spot outliers from a scatterplot of Y on X on X

For example, a company is an outlier if it is very small in terms of For example, a company is an outlier if it is very small in terms of size and it pays an audit fee that is very highsize and it pays an audit fee that is very high

In multiple regression (where there are multiple X In multiple regression (where there are multiple X variables), some observations may be “outliers” even variables), some observations may be “outliers” even though they do not show up on the scatterplotthough they do not show up on the scatterplot

Moreover, observations that show up as outliers on the Moreover, observations that show up as outliers on the scatterplot might actually be normal once we control for scatterplot might actually be normal once we control for other factors in the multiple regressionother factors in the multiple regression

For example the small company may pay a high audit fee For example the small company may pay a high audit fee because other characteristics of that company make it a because other characteristics of that company make it a complex audit.complex audit.

9494

2.9 Outlying observations (2.9 Outlying observations (cooksdcooksd))

We can calculate the influence of each observation on We can calculate the influence of each observation on the estimated coefficients using Cook’s Dthe estimated coefficients using Cook’s D

Values of Cook’s D that are higher than 4/N are Values of Cook’s D that are higher than 4/N are considered large, where N is the number of observations considered large, where N is the number of observations used in the regressionused in the regression

reg lnaf lnta big6reg lnaf lnta big6 predict cook, cooksdpredict cook, cooksd sum cook, detailsum cook, detail gen max=4/e(N)gen max=4/e(N) e(N) is the number of observations in the most recent regression e(N) is the number of observations in the most recent regression

model (the estimation sample size is stored by STATA as an model (the estimation sample size is stored by STATA as an internal result)internal result)

count if cook>max & cook<.count if cook>max & cook<.

9595

2.9 Outlying observations (2.9 Outlying observations (cooksdcooksd))

We can discard the observations that have values larger We can discard the observations that have values larger than Cook’s D and re-estimate the modelthan Cook’s D and re-estimate the model reg lnaf lnta big6 if cook<=maxreg lnaf lnta big6 if cook<=max

For example, Ke and Petroni (2004, p.906) explain that For example, Ke and Petroni (2004, p.906) explain that they use Cook’s D to exclude outliers and the standard they use Cook’s D to exclude outliers and the standard errors are adjusted for heteroscedasticity and time-series errors are adjusted for heteroscedasticity and time-series dependence (they are using a panel dataset):dependence (they are using a panel dataset):

9696

2.9 Outlying observations (2.9 Outlying observations (winsorwinsor))

Rather than drop outlying observations, some Rather than drop outlying observations, some researchers choose to “winsorize” the dataresearchers choose to “winsorize” the data

Winsorizing replaces the extreme values of a variable Winsorizing replaces the extreme values of a variable with the values at certain percentiles (e.g., the top and with the values at certain percentiles (e.g., the top and bottom 1%)bottom 1%)

You can winsorize variables in STATA using the winsor You can winsorize variables in STATA using the winsor commandcommand

winsor lnaf, gen(wlnaf) p(0.01)winsor lnaf, gen(wlnaf) p(0.01) winsor lnta, gen(wlnta) p(0.01)winsor lnta, gen(wlnta) p(0.01) sum lnaf wlnaf lnta wlnta, detailsum lnaf wlnaf lnta wlnta, detail reg wlnaf wlnta big6reg wlnaf wlnta big6

A disadvantage with “winsorizing” is that the researcher A disadvantage with “winsorizing” is that the researcher is assuming that outliers lie is assuming that outliers lie onlyonly at the extremes of the at the extremes of the variable’s distribution.variable’s distribution.

9797

2.10 Median regression2.10 Median regression

Median regression is quite similar to OLS Median regression is quite similar to OLS but it can be more reliable especially when but it can be more reliable especially when we have problems of outlying observationswe have problems of outlying observations

Recall that in OLS, the coefficient Recall that in OLS, the coefficient estimates are chosen to minimize the sum estimates are chosen to minimize the sum of the of the squaredsquared residuals residuals

9898

2.10 Median regression2.10 Median regression In median regression, the coefficient estimates are In median regression, the coefficient estimates are

chosen to minimize the sum of the chosen to minimize the sum of the absoluteabsolute residuals residuals

Squaring the residuals in OLS means that large Squaring the residuals in OLS means that large residuals are more heavily weighted than small residuals are more heavily weighted than small residuals.residuals.

Because the residuals are not squared in median Because the residuals are not squared in median regression, the coefficient estimates are less sensitive to regression, the coefficient estimates are less sensitive to outliersoutliers

9999


Median regression takes its name from its Median regression takes its name from its predicted values, which are estimates of the predicted values, which are estimates of the conditional conditional medianmedian of the dependent variable. of the dependent variable.

In OLS, the predicted values are estimates of In OLS, the predicted values are estimates of the conditional the conditional meanmean of the dependent variable. of the dependent variable.

The predicted values of both regression The predicted values of both regression techniques therefore measure the central techniques therefore measure the central tendency (i.e., mean or median) of the tendency (i.e., mean or median) of the dependent variable.dependent variable.

100100


STATA treats median regression as a STATA treats median regression as a special case of quantile regression. special case of quantile regression.

In quantile regression, the coefficients are In quantile regression, the coefficients are estimated so that the sum of the weighted estimated so that the sum of the weighted absolute residuals is minimizedabsolute residuals is minimized

where the weights are wwhere the weights are wii

101101

2.10 Median regression (2.10 Median regression (qregqreg))

Weights can be different for positive and Weights can be different for positive and negative residuals. If positive and negative negative residuals. If positive and negative residuals are weighted equally, you get a residuals are weighted equally, you get a median regression. If positive residuals are median regression. If positive residuals are weighted by the factor 1.5 and negative weighted by the factor 1.5 and negative residuals are weighted by the factor 0.5, you get residuals are weighted by the factor 0.5, you get a “3a “3rdrd quartile regression”, etc. quartile regression”, etc.

In STATA you perform quantile regressions In STATA you perform quantile regressions using the using the qregqreg command command qreg lnaf lnta big6 qreg lnaf lnta big6 reg lnaf lnta big6reg lnaf lnta big6

102102

Class exercise 2gClass exercise 2g

Open the “anscombe.dta” fileOpen the “anscombe.dta” file Do a scatterplot of y3 and x3Do a scatterplot of y3 and x3 Do an OLS regression of y3 on x3 for the Do an OLS regression of y3 on x3 for the

full sample.full sample. Calculate Cook’s D to test for the Calculate Cook’s D to test for the

presence of outliers.presence of outliers. Do an OLS regression of y3 on x3 after Do an OLS regression of y3 on x3 after

dropping any outliers.dropping any outliers. Do a median regression of y3 on x3 for the Do a median regression of y3 on x3 for the

full sample.full sample.

103103


Basu and Markov (2004) compare the Basu and Markov (2004) compare the results of OLS and median regressions to results of OLS and median regressions to determine whether analysts who issue determine whether analysts who issue earnings forecasts attempt to minimize:earnings forecasts attempt to minimize: the sum of the sum of squaredsquared forecast errors (OLS), or forecast errors (OLS), or the sum of the sum of absoluteabsolute forecast errors (median) forecast errors (median)

104104

105105

The LAD estimator is simply the median regression The LAD estimator is simply the median regression command that we saw earlier (command that we saw earlier (qregqreg))

106106

Basu and Markov (2004) conclude that analysts’ forecasts Basu and Markov (2004) conclude that analysts’ forecasts may accurately reflect their rational expectationsmay accurately reflect their rational expectations

Their study is a good example of how we can make an Their study is a good example of how we can make an important contribution to the literature if we use an estimation important contribution to the literature if we use an estimation technique that is not widely used by accounting researcherstechnique that is not widely used by accounting researchers

107107

2.11 “Looping”2.11 “Looping” Looping can be very useful when we want to carry out Looping can be very useful when we want to carry out

the same operations many timesthe same operations many times Looping significantly reduces the length of our do files Looping significantly reduces the length of our do files

because it means we do not have to state the same because it means we do not have to state the same commands many timescommands many times

When software designers use the word “programming” When software designers use the word “programming” they mean they are creating a new commandthey mean they are creating a new command

Usually we do not need new commands because what Usually we do not need new commands because what we need has already been written for us in STATAwe need has already been written for us in STATA

However, programming is necessary if we want to use However, programming is necessary if we want to use “looping”“looping”

108108

2.11 “Looping” (2.11 “Looping” (programprogram, , forvaluesforvalues))

Example:Example:

program tenprogram ten

forvalues i = 1(1)10 {forvalues i = 1(1)10 {

display `i'display `i'

}}

endend To run this program we simply type To run this program we simply type tenten

109109


What’s happening?What’s happening? program tenprogram ten : we are telling STATA that the name of our : we are telling STATA that the name of our

program is “ten” and that we are starting to write a programprogram is “ten” and that we are starting to write a program endend : we are telling STATA that we have finished writing the : we are telling STATA that we have finished writing the

programprogram { }{ } : everything inside these brackets is part of a loop : everything inside these brackets is part of a loop forvalues i = forvalues i = : the program will perform the commands inside the : the program will perform the commands inside the

brackets for each value of i (i is called a “local macro”)brackets for each value of i (i is called a “local macro”) 1(1)101(1)10 : i goes from one to ten, increasing by the value one : i goes from one to ten, increasing by the value one

every time every time display `i'display `i': this is the command inside the brackets and STATA : this is the command inside the brackets and STATA

will execute this command for each value of i from one to ten. will execute this command for each value of i from one to ten. Note that Note that `̀ is at the top left of your keyboard whereas is at the top left of your keyboard whereas ' ' is next to is next to the Enter keythe Enter key

110110


The macro i has single quotes around it. These quotes The macro i has single quotes around it. These quotes tell Stata to replace the macro with the value that it holds tell Stata to replace the macro with the value that it holds before executing the command. So the first time through before executing the command. So the first time through the loop, i holds the value of 1. Stata first replaces ì' with the loop, i holds the value of 1. Stata first replaces ì' with 1, and then it executes the command 1, and then it executes the command

display 1display 1 The next time through, i holds the value of 2. Stata first The next time through, i holds the value of 2. Stata first

replaces ì' with the value 2, and then it executes the replaces ì' with the value 2, and then it executes the command command

display 2display 2 This process continues through the values 3, 4,...,10. This process continues through the values 3, 4,...,10.

111111

2.11 “Looping” (2.11 “Looping” (capturecapture))

Suppose we make a mistake in the program or Suppose we make a mistake in the program or we want to modify the program in some waywe want to modify the program in some way

We first need to drop this program from STATA’s We first need to drop this program from STATA’s memory memory program drop ten program drop ten we can then go on to write a new program called tenwe can then go on to write a new program called ten

It is good practice to drop any program that It is good practice to drop any program that might exist with the same name before writing a might exist with the same name before writing a new programnew program capture program drop tencapture program drop ten

112112

2.11 “Looping”2.11 “Looping”

Our program is now:Our program is now:capture program drop tencapture program drop tenprogram tenprogram ten

forvalues i = 1(1)10 {forvalues i = 1(1)10 {display `i'display `i'

}}endend To run this program we simply type To run this program we simply type tenten

113113

Another exampleAnother example

Earnings management studies often estimate Earnings management studies often estimate “abnormal accruals” using the Jones model:“abnormal accruals” using the Jones model:

ACCRUALSACCRUALS itit = = αα0k0k (1/ASSET(1/ASSETit-1it-1) + ) + αα1k 1k ((SALESSALESitit / / ASSETASSETit-1it-1) + ) + αα2k 2k (PPE(PPEit it /ASSET/ASSETit-1it-1) + u) + uitit

ACCRUALSACCRUALS itit = change in non-cash current assets = change in non-cash current assets minus change in non-debt current liabilities, scaled minus change in non-debt current liabilities, scaled by lagged assets.by lagged assets.

The k sub-scripts indicate that the model is The k sub-scripts indicate that the model is estimated separately for each industry.estimated separately for each industry.

Industries are identified using Standard Industries are identified using Standard Industrial Classification (SIC) codesIndustrial Classification (SIC) codes

114114

Another exampleAnother example The number of industries = The number of industries =

10 using one digit codes, 10 using one digit codes, 100 using two digit codes, 100 using two digit codes, 1,000 using three digit codes, etc1,000 using three digit codes, etc

Your do file could be very long if you had Your do file could be very long if you had separate lines for each industry:separate lines for each industry: Estimate abnormal accruals for SIC = 1Estimate abnormal accruals for SIC = 1 Estimate abnormal accruals for SIC = 2Estimate abnormal accruals for SIC = 2 Estimate abnormal accruals for SIC = 3Estimate abnormal accruals for SIC = 3 …….... Estimate abnormal accruals for SIC = 10, etc.Estimate abnormal accruals for SIC = 10, etc.

115115

Another exampleAnother example

Your do file will be much shorter if you use the Your do file will be much shorter if you use the looping techniquelooping technique

capture program drop ab_acccapture program drop ab_acc

program ab_accprogram ab_acc

forvalues i = 1(1)10 {forvalues i = 1(1)10 {

insert commands that you want to execute on insert commands that you want to execute on each industry SIC codeeach industry SIC code

}}

endend

116116

Another exampleAnother example Go to My Site. OGo to My Site. Open “accruals.dta” and generate the variablespen “accruals.dta” and generate the variables the regressions will be estimated at the one-digit levelthe regressions will be estimated at the one-digit level

use "J:\phd\accruals.dta", clearuse "J:\phd\accruals.dta", clear gen one_sic=int(sic/1000)gen one_sic=int(sic/1000) gen ncca= current_assets- cashgen ncca= current_assets- cash gen ndcl= current_liabilities- debt_in_current_liabilitiesgen ndcl= current_liabilities- debt_in_current_liabilities sort cik yearsort cik year gen ch_ncca=ncca-ncca[_n-1] if cik==cik[_n-1]gen ch_ncca=ncca-ncca[_n-1] if cik==cik[_n-1] gen ch_ndcl=ndcl-ndcl[_n-1] if cik==cik[_n-1]gen ch_ndcl=ndcl-ndcl[_n-1] if cik==cik[_n-1] gen accruals=(ch_ncca-ch_ndcl)/assets[_n-1] if cik==cik[_n-1]gen accruals=(ch_ncca-ch_ndcl)/assets[_n-1] if cik==cik[_n-1] gen lag_assets=assets[_n-1] if cik==cik[_n-1]gen lag_assets=assets[_n-1] if cik==cik[_n-1] gen ppe_scaled=ppe/assets[_n-1] if cik==cik[_n-1]gen ppe_scaled=ppe/assets[_n-1] if cik==cik[_n-1] gen chsales_scaled=(sales-sales[_n-1])/assets[_n-1] if cik==cik[_n-1]gen chsales_scaled=(sales-sales[_n-1])/assets[_n-1] if cik==cik[_n-1]

117117

Another exampleAnother example gen ab_acc=.gen ab_acc=. capture program drop ab_acccapture program drop ab_acc program ab_accprogram ab_acc forvalues i = 0(1)9 {forvalues i = 0(1)9 { capture reg accruals lag_assets ppe_scaled capture reg accruals lag_assets ppe_scaled

chsales_scaled if one_sic==ì'chsales_scaled if one_sic==ì' capture predict ab_accì' if one_sic==ì', residcapture predict ab_accì' if one_sic==ì', resid replace ab_acc= ab_accì' if one_sic==ì'replace ab_acc= ab_accì' if one_sic==ì' capture drop ab_accì'capture drop ab_accì' }} endend ab_accab_acc

118118

Explaining this programExplaining this program forvalues i = 0(1)9 {forvalues i = 0(1)9 {

the one_sic variable takes values from 0 to 9the one_sic variable takes values from 0 to 9 capture reg accruals lag_assets ppe_scaled chsales_scaled if one_sic==ì'capture reg accruals lag_assets ppe_scaled chsales_scaled if one_sic==ì'

the regressions are run at the one digit level because some industries have the regressions are run at the one digit level because some industries have insufficient observations at the two-digit levelinsufficient observations at the two-digit level

capture predict ab_accì' if one_sic==ì', residcapture predict ab_accì' if one_sic==ì', resid For each industry, I create a separate abnormal accrual variable (ab_acc1 if For each industry, I create a separate abnormal accrual variable (ab_acc1 if

industry #1 ab_acc2 if industry #2, etc.). industry #1 ab_acc2 if industry #2, etc.). If this line had been If this line had been capture predict ab_acc if one_sic==ì', resid capture predict ab_acc if one_sic==ì', resid we would not we would not

have been able to go beyond industry #1 as the ab_acc was already defined have been able to go beyond industry #1 as the ab_acc was already defined replace ab_acc= ab_accì' if one_sic==ì'replace ab_acc= ab_accì' if one_sic==ì'

The overall abnormal accrual variable (ab_acc) equals ab_acc1 if industry #1, The overall abnormal accrual variable (ab_acc) equals ab_acc1 if industry #1, equals ab_acc2 if industry #2, etc.equals ab_acc2 if industry #2, etc.

before starting the program I had to before starting the program I had to gen ab_acc=. gen ab_acc=. in order for this in order for this replace replace command to workcommand to work

capture drop ab_accì'capture drop ab_accì' I drop ab_acc1, ab_acc2, etc. because I only need the ab_acc variable.I drop ab_acc1, ab_acc2, etc. because I only need the ab_acc variable.

119119

ConclusionConclusion

You should now have a good You should now have a good understanding of understanding of how OLS models work, how OLS models work, how to interpret the results of OLS models,how to interpret the results of OLS models, how to find out whether the assumptions of how to find out whether the assumptions of

OLS are violated,OLS are violated, how to correct the standard errors for how to correct the standard errors for

heteroscedasticity and time-series heteroscedasticity and time-series dependencedependence

how to handle problems of outliers how to handle problems of outliers

120120

ConclusionConclusion

So far, we have been discussing the case So far, we have been discussing the case where our dependent variable is where our dependent variable is continuous (e.g., lnaf)continuous (e.g., lnaf)

When the dependent variable is not When the dependent variable is not continuous, we cannot use OLS (or continuous, we cannot use OLS (or quantile) regression. quantile) regression.

The next topic considers how to estimate The next topic considers how to estimate models where our dependent variable is models where our dependent variable is not continuous.not continuous.

1 2. linear dependent variables 2.1 the basic idea underlying linear regression 2.2 single...

Documents