ch26 exercises

Upload: amisha2562585

Post on 03-Apr-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 Ch26 Exercises

    1/14

    9/27/2006

    26 Exercises

    Mix and Match

    For this matching exercises, refer to the multiple regression equation

    y = 0 + 1x + 2d + 3xd +

    where y and x are numerical variables and d is a dummy variable. y is the annual salaryof an employee (in thousands of dollars) and x denotes the years of experience. d iscoded as 1 for college grads and is coded as 0 for those lacking a college degree.

    1. Intercept for high school grad a. 12. Intercept for college grad b. 33. $M/year, high school grad c. 0 + 24. $M/year, college grad d. 0 + 1 105. Difference in slopes e. dx6. Difference in intercepts f. 7. Interaction g. 1+38. Equal variances h. 0+2 + (1+3)109. Average salary for high school grad with

    10 years of experience.i. 0

    10.Average salary for college grad with 10years of experience

    j. 2

    True/False

    11.The two-sample t-test is possibly confounded if the groups differ in ways other thanthe labeling that distinguishes the groups.

    12.An analysis of covariance is another name for the use of randomization to avoidconfounding.13.A dummy variable is a numerical encoding using 0s and 1s that distinguishes the

    members of two groups.

    14.To build the interaction between x and a dummy variable d, we multiply x times d.15.If the multiple regression implies parallel fits, the slope of the dummy variable is the

    difference between the two fitted lines.

  • 7/28/2019 Ch26 Exercises

    2/14

    9/27/2006 26 Exercises

    E26-2

    16.A multiple regression with a numerical predictor and a dummy variable as twopredictors implies parallel fits to the two groups.

    17.The purpose of an interaction variable is to force fits in the two groups to be parallel.18.Interaction variables typically introduce collinearity into a multiple regression and

    should be removed from the fit if not statistically significant.

    19.If neither the interaction nor dummy variable is statistically significant in an analysisof covariance, then theres no lurking factor that confounds the results of the relatedtwo-sample t-test.

    20.To be a confounding variable, the variable must be related to y and to the dummyvariable indicating group membership.

    21.A major assumption of the use of regression with dummy variables is that the size ofthe two groups be approximately the same in order to increase the variation of thedummy variable.

    22.To check the similar variances condition in models with a dummy variable, usecomparison boxplots of y versus the categorical variable.

    Think About It

    23.These comparison boxplots show the revenue generated individual salesrepresentatives who operate in divisions supervised by two different managers.Whats the problem using a two-sample t-test to judge the statistically significance ofthe apparent difference?

    Revenue($M)

    10

    15

    20

    25

    30

    35

    40

    45

    50

    A B

    Manager Level Number Mean Std DevA

    24

    25.0295

    5.33875

    B 37 35.9265 6.89008

    24.An auditor collected a random samples of about 100 invoices paid in the currentfiscal year and compared the amounts of these invoices to those of a second randomsample of invoices paid in the prior fiscal year. These boxplots summarize theamounts (in dollars) of the two sets of invoices.

  • 7/28/2019 Ch26 Exercises

    3/14

    9/27/2006 26 Exercises

    E26-3

    Invo

    iceAmount($)

    0

    10000

    20000

    30000

    40000

    50000

    60000

    70000

    80000

    90000

    2005 2006

    Fiscal Year Year Number Mean Std Dev2005 111 22199.3 16185.42006 109 25116.3 17702.3

    Would you suggest that the auditor perform a two-sample t-test to compare themean values of these invoices, or can you suggest one (or more) lurking factors thatshould be taken into account prior to the comparison?

    25.When fitting the regression of y on x for two groups, we can estimate the slope andintercept within each group by either fitting two simple regressions or by fitting onemultiple regression. If simple regressions are so much easier to interpret, whybother to glue them together into one multiple regression?

    26.What assumption is required when we combine two simple regressions into onemultiple regression using a dummy variable and an interaction? The MRM requiresan assumption that the combination of the two SRMs does not require. What is it,

    and what condition of the MRM does it affect?27.An industry analyst constructed a model describing the cost of building cars at

    plants operated by different manufacturers in North America. As a first step, theanalyst regressed total production cost (in dollars) on the number of labor hours fora sample of vehicles. The data used came from two plants, one operated by adomestic manufacturer under contract with the UAW, the United AutomobileWorkers, and the other operating a non-unionized plant. (UAW members cost morethan nonunion labor; The Wall Street Journal in May 2006 estimated total costs run$74 per hour if benefits are included.) The analyst included a dummy variable in theregression indicating the plant. Do you think the analyst should also include an

    interaction (between plant and labor hours)?28.Matsushita is well known for the efficiency of its automated factories. Facing

    pressure from developing Asian producers with lower labor costs, the companyreconfigured robots in its factory in Saga, Japan. After the modification, it takes 40minutes to configure the assembly line and start production. Formerly, it took about20 hours.1 Once production begins, the plant runs as previously; the robots are thesame, only reconfigured to simplify changing tasks. In order to analyze the

    1 Reported in Business Week (7/10/2006). No one does lean like the Japanese 40-41.

  • 7/28/2019 Ch26 Exercises

    4/14

    9/27/2006 26 Exercises

    E26-4

    association between the time to complete a production run (the response) and thenumber of units produced, how will this modification change the nature of the fittedequation. Do you expect the slope, intercept, and error variance all to change? Notethe interpretation of these parameters in the context of this data.

    29.A two-sample t-test has a lot in common with regression. This output summarizesthe results of fitting a simple regression with only a dummy variable as theexplanatory variable. The data is the same salary data used in the text, with salaryregressed on Group.

    110

    120

    130

    140

    150

    160

    170

    Salary($M)

    0 .25 .5 .75 1

    Group R2 0.019116se 12.42868n 220

    Term Estimate Std Error t Stat p-valueIntercept 140.46667 1.43514 97.88 |t|

  • 7/28/2019 Ch26 Exercises

    5/14

    9/27/2006 26 Exercises

    E26-5

    Term Estimate Std Error t Ratio Prob>|t|Intercept 29.693333 2.551241 11.64

  • 7/28/2019 Ch26 Exercises

    6/14

    9/27/2006 26 Exercises

    E26-6

    50

    100

    150

    200

    250

    M

    inutes

    20 30 40 50 60 70 80 90 100 110

    Units

    a) If we fit a separate equation to each group, then what is the interpretation of theintercept in either fit? Include the units as part of your description.

    b) What is the interpretation of the slope in either fit? Include the units as part of

    your description.b) Will an analysis of covariance require an interaction term, or can you skip thisstep and only fit a dummy variable to distinguish the two groups?

    33.The following output summarizes the fit of an analysis of covariance to the data inQuestion 31. The variable D denotes a dummy variable, with D=1 for values coloredgreen and 0 otherwise.

    Term Estimate Std Error t Stat p-valueIntercept 5.7180783 0.13508 42.33

  • 7/28/2019 Ch26 Exercises

    7/14

    9/27/2006 26 Exercises

    E26-7

    You Do It

    35.Emerald diamondsThese data are a subset of the diamonds used in Chapter 19. This data table of 144diamonds includes the price (in dollars), the weight (in carats), and the clarity gradeof the diamonds. The diamonds have clarity grade either VS1 or VVS1.

    (a) Would it be appropriate to use a two-sample t-test to compare the average pricesof VS1 and VVS1 diamonds, or is this relationship confounded by the weights of thediamonds?

    (b) Perform the two-sample t-test to compare the prices of the two clarity grades.Summarize this analysis, assuming that there are no lurking variables.

    (c) Compare the prices of the two types of diamonds using an analysis of covariance.Summarize the comparison of prices based on this analysis. Use a dummy variablecoded as 1 for VVS1 diamonds and 0 otherwise. (Assume for the moment that themodel meets the conditions for the MRM.)

    (d) Compare the results from b and c. Do they agree? Explain why they agreeor differ. You should take account the precision of the estimates and your answer toa.

    (e) What problem bedevils the multiple regression used for the analysis ofcovariance that is not present in the two-sample t-test?

    36.Convenience shoppingThese data expand the data table introduced in Chapter 19 by introducing data froma second location. For each of two service stations operated by a national petroleumrefiner, we have the daily sales in the convenience store located at the service

    station. The data for each day give the sales at the store (in dollars) and the numberof gallons of gasoline sold. For Site 1, the data cover 283 days; for site 2, the datacover 285 days.

    (a) Would it be appropriate for management of this chain of service stations to ratethe operators of the convenience stores based on a two-sample comparison of thesales of the convenience stores during these two periods, or would such acomparison be confounded by different levels of traffic (as measured by the volumeof gasoline sold)?

    (b) Perform the two-sample t-test to compare the sales of the two service stations.Summarize this analysis, assuming that there are no lurking variables.

    (c) Compare the sales at the two sites using an analysis of covariance. Summarizethe comparison of sales based on this analysis. Use a dummy variable coded as 1 forSite 1 and 0 otherwise. (Assume for the moment that the model meets theconditions for the MRM.)

    (d) Compare the results from b and c. Do they agree? Explain why they agreeor differ. You should take account the precision of the estimates and your answer toa.

  • 7/28/2019 Ch26 Exercises

    8/14

  • 7/28/2019 Ch26 Exercises

    9/14

    9/27/2006 26 Exercises

    E26-9

    (b) Perform the two-sample t-test to compare the average cost per unit at the twoplants. Summarize this analysis, assuming that there are no lurking variables.

    (c) Compare the average cost per unit at the two plants using an analysis ofcovariance. Summarize the comparison based on this analysis. Represent thesecategories using a dummy variable coded as 1 if the plant is new. (Assume for the

    moment that the model meets the conditions for the MRM.)(d) Compare the results from b and c. Do they agree? Explain why they agreeor differ. You should take account the precision of the estimates and your answer toa.

    (e) Does the estimated multiple regression used in the analysis of covariance meetthe similar variances condition?

    39.Home pricesThis data table expands the data introduced in Chapter 19 on the prices of homes inthe Seattle area. One realtor operating in Seattle listed these 28 homes. This table

    includes prices and sizes of 8 more homes listed by a different realtor in Seattle. Aspreviously, well look at the price per square foot, using as numerical predictor thereciprocal of the number of square feet as the explanatory variable. In this model,the intercept estimates the variable cost per square foot and the slope of 1/SqFtestimates the fixed costs present regardless of the size of the home.

    (a) Scatterplot the cost per square foot of the homes on the reciprocal of the size ofthe homes. Do you see a difference in the relationship between cost per square footand 1/SqFt for the two realtors? Use color-coding or different symbols todistinguish for the data of the two realtors.

    (b) Based on your visual impression formed in a, fit an appropriate regression

    model that describes the fixed and variable costs for these realtors. Use a dummyvariable coded as 1 for Realtor B to represent the different realtors in theregression.

    (c) Does the estimated multiple regression fit in b meet the conditions for theMRM?

    (d) Interpret the estimated coefficients from the equation fit in b, if it is OK to doso. If not, indicate why not.

    (e) Would it be appropriate to use the estimated standard errors shown in the outputof your regression estimated in b to set confidence intervals for the estimated

    intercept and slopes? Explain.40.Leases (Introduced in Chapter 19)

    This data table includes the annual prices of 223 commercial leases. All of theseleases provide office space in a Midwestern city in the US. In previous exercises, weestimated the variable costs (costs that increase with the size of the lease) and fixedcosts (those present regardless of the size of the property) using a regression of thecost per square foot on the reciprocal of the number of square feet. The interceptestimates the variable costs and the slope estimates the fixed costs. Some of these

  • 7/28/2019 Ch26 Exercises

    10/14

    9/27/2006 26 Exercises

    E26-10

    leases cover space in the downtown area, whereas others are located in the suburbs.The variable Location identifies these two categories.

    (a) Scatterplot the cost per square foot of the leases on the reciprocal of the squarefeet of the lease. Do you see a difference in the relationship between cost per squarefoot and 1/SqFt for the two locations? Use color-coding or different symbols to

    distinguish for the data of the two locations.(b) Based on your visual impression formed in a, fit an appropriate regressionmodel that describes the fixed and variable costs for these leases. Use a dummyvariable coded as 1 for leases in the city and 0 for the suburban leases.

    (c) Does the estimated multiple regression fit in b meet the conditions for theMRM?

    (d) Interpret the estimated coefficients from the equation fit in b, if it is OK to doso. If not, indicate why not.

    (e) Would it be appropriate to use the estimated standard errors shown in the output

    of your regression estimated in b to set confidence intervals for the estimatedintercept and slopes? Explain.

    41.R&D expensesThis data file contains a variety of accounting and financial values that describecompanies operating in technology industries: software, systems design, andsemiconductor manufacturing. One column gives the expenses on research anddevelopment (R&D), and another gives the total assets of the companies. Both ofthese columns are reported in millions of dollars. This data table expands previousversions (introduced in Chapter 19) by adding data for 2003 to the data for 2004. Toestimate regression models, we need to transform both expenses and assets to a log

    scale.(a) Plot the log of R&D expenses on the log of asset for 2003 and 2004 together in onescatterplot. Use color-coding or distinct symbols to distinguish the groups. Does itappear that the relationship is different in these two years, or can you capture theassociation with a single simple regression?

    A common question asked when fitting models to subsets is Do the equations forthe two groups differ from each other? For example, does the equation for 2003differ from the equation for 2004? Weve been answering this question informally,using the t-statistics for the slopes of the dummy variable and interaction. Theresjust one small problem: were using two tests to answer one question. Whats the

    chance for a false positive error? If youve got one question, better to use one test.

    To see if theres any difference, we can use a variation on the F test for R2. The ideais to test both slopes at once rather than separately. The method uses the change inthe size of R2. If the R2 of the model increases by a statistically significant amountwhen we add both the dummy variable and interaction to the model, then somethingchanged and the model is different. The form of this incremental, or partial, F test is

    F=Change in R2 /(number of added slopes)

    (1"Rfull2

    ) /(n "1" qfull)

  • 7/28/2019 Ch26 Exercises

    11/14

    9/27/2006 26 Exercises

    E26-11

    In this formula, q denotes the number of variables in the model with all the bells andwhistles, including dummy variables and interactions. R2full is the R2 for that model.As usual, a big value for this F-statistic is 4.

    (b) Add a dummy variable (coded as 0 for 2004 and 1 for 2003) and its interactionwith log assets to the model. Does the fit of this model meet the conditions for the

    MRM? Comment on the consequences of any problem that you identify.(c) Assuming that the model meets the conditions for the MRM, use the incrementalF-test to assess the size of the change in R2. Does the test agree with your visualimpression? (The value of qfull for the model with dummy and interaction is 3, with2 slopes added. You will need to fit the simple regression of log R&D expenses onlog assets to get the R2 from this model.)

    (d) Summarize the fit of the model that best captures what is happening in these twoyears.

    42.CarsThe cases that make up this data set are cars. For each of 223 types of cars sold inthe US during the 2003 and 2004 model years, we have the base price and thehorsepower of the engine (HP). In previous exercises, we found that a model for theassociation of price and horsepower required taking logs of both variables. (We usedbase 10 logs.) The column Location denotes the continent of the home country ofthe manufacturer. (This is a bit loose, since Ford owns Jaguar and GM owns Saab.We coded these as European anyhow. Similarly, we labeled Chrysler as US eventhough it was absorbed by Daimler, a.k.a., Mercedes.) Alas, we have three groups.To simplify the analysis, well compare domestic cars to imports from Europe. Thedata set for this exercise hence excludes cars from Asian manufacturers.

    (a) Plot the log10 of price on the log10 of horsepower for cars from both groups ofmanufacturers in one scatterplot. Use color-coding or distinct symbols todistinguish the groups. Does it appear that the relationship is different in these twoyears, or can you capture the association with a single simple regression?

    (b) Add a dummy variable (coded as 0 for US and 1 for European designs) and itsinteraction with log10 HP to the model. Does the fit of this model meet theconditions for the MRM? Comment on the consequences of any problem that youidentify.

    (c) Assuming that the model meets the conditions for the MRM, use the incrementalF-test to assess the size of the change in R2. (See the discussion of this test in

    Question 41.) Does the test agree with your visual impression? (The value of qfull forthe model with dummy and interaction is 3, with 2 slopes added. You will need tofit the simple regression of log R&D expenses on log assets to get the R2 from thismodel.)

    (d) Compare the conclusion of the incremental F-test to the tests of the coefficients ofthe dummy variable and interaction separately. Do these agree? Explain thesimilarity or difference.

  • 7/28/2019 Ch26 Exercises

    12/14

    9/27/2006 26 Exercises

    E26-12

    43.MoviesThese data (used also in Chapter 20) describe the box-office success of 224 moviesreleased during the years 1998 through 2001. For this analysis, were interested inthe relationship between initial success at the movie theatre and subsequent sales forpay-per-view services, such as those offered by cable television. All of these movies

    are rated either G or PG, withAudience set to Family, or rated R, withAudience setto Adult. We dropped movies rated PG-13.

    (a) Plot the log10 of subsequent sales on the log10 of the box-office gross for moviesfrom both groups in one scatterplot. Use color-coding or distinct symbols todistinguish the groups. Does it appear that the relationship between box officesuccess and subsequent video sales differs for the two categories, or can you capturethe association with a single simple regression?

    (b) Add a dummy variable (coded as 1 for adult audiences and 0 for familyaudiences) and its interaction with log10Gross to the model. Does the fit of thismodel meet the conditions for the MRM? Comment on the consequences of any

    problem that you identify.(c) Assuming that the model meets the conditions for the MRM, use the incrementalF-test to assess the size of the change in R2. (See the discussion of this test inQuestion 41.) Does the test agree with your visual impression? (The value of qfull forthe model with dummy and interaction is 3, with 2 slopes added. You will need tofit the simple regression to get its R2 for comparison to the multiple regression.)

    (d) Compare the conclusion of the incremental F-test to the tests of the coefficients ofthe dummy variable and interaction separately. Do these agree? Explain thesimilarity or difference.

    (e) Whats your take on the subsequent success of movies? Does the box-office grosstell you something different about movies intended for adults versus those for thefamily?

    44.Hiring (Introduced in Chapter 19)A firm that operates a large, direct-to-consumer sales force would like to be able toput in place a system to monitor the progress of new agents. A key task for agents isto open new accounts; an account is a new customer to the business. The goal is toidentify superstar agents as rapidly as possible, offer them incentives, and keepthem with the company. To build such a system, the firm has been monitoring salesof new agents over the past two years. The response of interest is the profit to thefirm (in dollars) of contracts sold by agents over their first year. Among the possiblepredictors of this performance is the number of new accounts developed by theagent during the first 3 months of work. Some of these agents were located in newoffices, whereas others joined an existing office (see the column labeled Office).

    (a) Plot the log of profit on the log of the number of accounts opened for both groupsin one scatterplot. Use color-coding or distinct symbols to distinguish the groups.Does the coloring explain an unusual aspect of the black and white scatterplot?Does a simple regression that ignores the groups provide a reasonable summary?

  • 7/28/2019 Ch26 Exercises

    13/14

    9/27/2006 26 Exercises

    E26-13

    (b) Add a dummy variable (coded as 1 for new offices and 0 for existing offices) andits interaction with logAccounts to the model. Does the fit of this model meet theconditions for the MRM? Comment on the consequences of any problem that youidentify.

    (c) Assuming that the model meets the conditions for the MRM, use the incremental

    F-test to assess the size of the change in R2. (See the discussion of this test inQuestion 41.) Does the test agree with your visual impression? (The value of qfull forthe model with dummy and interaction is 3, with 2 slopes added. You will need tofit the simple regression to get its R2 for comparison to the multiple regression.)

    (d) Compare the conclusion of the incremental F-test to the tests of the coefficients ofthe dummy variable and interaction separately. Do these agree? Explain thesimilarity or difference.

    (e) Whats your take on locating new hires in new or existing offices? Would yourecommend locating them in one or the other (assuming it could be done withoutdisrupting the current placement procedures)?

    45.PromotionThese data describe spending by a pharmaceutical company to promote acholesterol-lowering drug. The data covers 39 consecutive weeks and isolates themetropolitan areas around Boston, Massachusetts, and Portland, Oregon. A subsetof this data was introduced in Chapter 19.

    The variables in this collection are shares. Marketing research often describes thelevel of promotion in terms of voice. In place of the level spending, voice is the shareof advertising devoted to a specific product. Voice puts spending in context; $10million might seem like a lot for advertising unless everyone else is spending $200million. The columnMarket Share is the ratio of sales of this product divided by total

    sales for such drugs in the Boston area. The column Detail Voice is the ratio ofdetailing for this drug to the amount of detailing for all cholesterol-lowering drugsin Boston. Detailing counts the number of promotional visits made byrepresentatives of a pharmaceutical company to doctors offices.

    (a) A hasty analyst fit the regression of Market Share on Detail Voice with the datafrom both locations combined. The analyst found a very statistically significantslope for Detail Voice, estimated larger than 1. (Implying at 1% more share ofdetailing would get on average 1% more of the market.) What mistake has theanalyst made?

    (b) Propose an alternative model and evaluate whether your alternative modelmeets the conditions of the MRM so that you can do confidence intervals.

    (c) Whats your interpretation of the relationship between detailing and marketshare? If you can, offer your impression as a range.

    46.iTunesThe music that you keep on an Apple iPod can be stored digitally in several formats.A popular format for Apple is known as AIFF, short for Audio Interchange FileFormat. Another format is known as AAC, short for Advanced Audio Coding. Files

  • 7/28/2019 Ch26 Exercises

    14/14

    9/27/2006 26 Exercises

    E26-14

    on an iPod can be in either of these formats, or both. The 596 songs in this data setuse a mixture of these two formats.

    (a) Based on the scatterplot of the amount of space needed on the length of thesongs, propose a model for how much space (in megabytes, MB) is needed to store asong of a given number of seconds.

    (b) Evaluate whether your model meets the conditions of the MRM so that you cando confidence intervals.

    (c) Interpret the estimated slopes in your model.

    (d) Construct, if appropriate, a prediction interval for the amount of disk spacerequired to store a song that is 240 seconds long using AAC and then AIFF format.How can you get intervals? (Be imaginative: the obvious approach has someproblems.)