mlb final project
TRANSCRIPT
1
Total Payroll vs. Winning Percentage
In Major League Baseball Bayesian Statistics
Fall, 2014
Lingwen He Zijian Su Xiangyu Li
Padraic O’Shea
2
Introduction
Major League Baseball (MLB) is the last professional sport in America to have not adopted a salary cap. The lack of a salary cap has led to large differences in the total payroll for big market teams vs. small market teams. This glaring difference in total payroll has fed the ongoing discussion of whether or not teams can “buy” wins by spending more money. To investigate whether teams that spend more money have higher winning percentages, we will explore the existence of a linear relationship between average total payroll and average winning percentage of MLB teams from 2004 to 2012.
Methods Data
From Baseball-‐Reference.com we acquired data on regular season winning percentage by team. This data can be accessed from the following link: http://www.baseball-‐reference.com/leagues/MLB/. Data on total payroll, by team, was acquired through USA Today. A link to that data is provided here: http://content.usatoday.com/sportsdata/baseball/mlb/salaries/team/2004.
To explore the linear relationship between total payroll and winning percentage over time one data point for each team was needed. To calculate these data points winning percentage and total payroll were collected from the 2004 to 2012 seasons and averaged by team. The predictor variable was re-‐scaled, by dividing by a million, to increase the size of the coefficient. Initial inference on the ‘averaged’ dataset did not indicate a severe violation of the assumption of normality but the normal-‐QQ plot was not perfectly linear. Three potential outliers were also identified while performing inference.
One at a time, the possible outliers were removed and analysis completed using residual plots and QQ plots. We found, overall, that removing the points did not improve the model or the fit of the distribution. Therefore, the dataset containing all points was used for the analysis. The plots used for inference can be found in Appendix C.
To begin understanding the data, descriptive statistics may be considered. The mean, median and standard deviation for each variable under consideration is given in Table 1 below. Predictably we see that average total salary appears skewed to the right as the mean is greater than the median. Average winning percentage appears to have a relatively normal distribution. Additionally, standard deviation is fairly large, especially for total salary.
3
Data (Avg. 2004 to 2012)
Mean Standard Deviation
Minimum Median Maximum
Total Salary 84.657 32.693 44.046 74.752 199.368 Winning % 0.5003 0.0414 0.40 0.50 0.58
Table 1. Descriptive Statistics for Study Variables Total Salary in Millions
Statistical Method
We hypothesize that average total payroll and average winning percentage are linearly associated. To assess this relationship, Bayesian simple linear regressions will be utilized with average winning percentage as the response. Two methods will be used to explore this linear relationship. Firstly, a non-‐informative prior to illustrate the lack of prior knowledge about the effects of salary on winning percentage. Next, an informative prior based on our prior beliefs. The two methods’ predictive outputs will then be compared.
For the informative prior a N(0.5, 0.05) for beta0 is used as our expectation for the winning percentage is 50% with small variance. For beta1 a N(0.1, 100) is used due to the lack of knowledge and an expectation that this rate will be positive, but not overly large. Our expectation for the variance of beta1 is that it will be large.
Convergence was assessed via OpenBUGS output by history plots, auto-‐correlation plots and MC_error values. Due to rapid convergence, only one chain was used for the MCMC integration. However, this meant BGR plots could not be used to assess burn-‐in. In an effort to exclude initial values, since they were based on intuition and likely not representative of the posterior distribution, a 3000 sample burn-‐in was used.
4
Results The results of the Bayesian simple linear regression models performed using R are
given below. Figure 1 contains the node statistics for the non-‐informative prior. The history plots and auto-‐correlation plots used for assessing convergence in the non-‐informative prior model can be found in Appendix A.
mean sd MC_error val2.5pc median val97.5pc start sample
beta0 0.5002 0.006425 6.46E-5 0.4876 0.5002 0.5131 3001 12000 beta1 7.962E-4 2.01E-4 1.935E-6 4.009E-4 7.949E-4 0.001184 3001 12000 mu[1] 0.4835 0.007636 7.081E-5 0.4684 0.4835 0.4989 3001 12000 mu[2] 0.5043 0.006521 6.684E-5 0.4914 0.5043 0.5171 3001 12000 mu[3] 0.4925 0.006691 6.445E-5 0.4792 0.4924 0.5059 3001 12000 mu[4] 0.5449 0.01305 1.345E-4 0.5189 0.5449 0.5704 3001 12000 mu[5] 0.5199 0.008179 8.614E-5 0.5033 0.52 0.536 3001 12000 mu[6] 0.5124 0.007157 7.505E-5 0.498 0.5125 0.5266 3001 12000 mu[7] 0.4873 0.007166 6.734E-5 0.4732 0.4872 0.5018 3001 12000 mu[8] 0.4809 0.008022 7.386E-5 0.465 0.4808 0.497 3001 12000 mu[9] 0.4862 0.007292 6.824E-5 0.4718 0.4862 0.501 3001 12000 mu[10] 0.5131 0.007238 7.598E-5 0.4986 0.5132 0.5275 3001 12000 mu[11] 0.499 0.006429 6.422E-5 0.4864 0.499 0.512 3001 12000 mu[12] 0.4767 0.008688 7.94E-5 0.4594 0.4767 0.4943 3001 12000 mu[13] 0.5241 0.008868 9.323E-5 0.5062 0.5242 0.5415 3001 12000 mu[14] 0.5121 0.00713 7.474E-5 0.4978 0.5122 0.5262 3001 12000 mu[15] 0.4716 0.009598 8.729E-5 0.4525 0.4716 0.491 3001 12000 mu[16] 0.4878 0.007113 6.698E-5 0.4738 0.4877 0.5022 3001 12000 mu[17] 0.4922 0.006711 6.455E-5 0.4789 0.4922 0.5057 3001 12000 mu[18] 0.4781 0.008451 7.74E-5 0.4614 0.4781 0.4951 3001 12000 mu[19] 0.5256 0.009123 9.582E-5 0.5072 0.5257 0.5434 3001 12000 mu[20] 0.5916 0.02402 2.405E-4 0.5445 0.5915 0.6389 3001 12000 mu[21] 0.4806 0.008057 7.414E-5 0.4647 0.4806 0.4968 3001 12000 mu[22] 0.5272 0.00943 9.892E-5 0.5083 0.5274 0.5457 3001 12000 mu[23] 0.4679 0.01032 9.374E-5 0.4475 0.4678 0.4886 3001 12000 mu[24] 0.4773 0.008586 7.852E-5 0.4602 0.4773 0.4947 3001 12000 mu[25] 0.5077 0.006722 6.976E-5 0.4943 0.5078 0.5211 3001 12000 mu[26] 0.5067 0.006652 6.882E-5 0.4935 0.5068 0.5199 3001 12000 mu[27] 0.5082 0.006758 7.023E-5 0.4947 0.5082 0.5216 3001 12000 mu[28] 0.4685 0.0102 9.27E-5 0.4483 0.4684 0.4889 3001 12000 mu[29] 0.4905 0.006852 6.532E-5 0.4769 0.4904 0.5042 3001 12000 mu[30] 0.4884 0.007049 6.655E-5 0.4745 0.4883 0.5027 3001 12000 postprob 0.9998 0.01581 1.431E-4 1.0 1.0 1.0 3001 12000 sigma 0.03472 0.004829 4.703E-5 0.02672 0.03418 0.04542 3001 12000 tausq 876.2 235.3 2.254 484.9 856.2 1401.0 3001 12000
Figure 1. Node Statistics for Non-‐Informative Prior
5
Figure 2 contains the node statistics for the informative prior. The history plots and auto-‐correlation plots used for assessing convergence in the informative prior model can be found in Appendix A.
mean sd MC_error val2.5pc median val97.5pc start sample
beta0 0.5002 0.006425 6.46E-5 0.4876 0.5002 0.5131 3001 12000 beta1 7.966E-4 2.01E-4 1.935E-6 4.013E-4 7.952E-4 0.001184 3001 12000 mu[1] 0.4835 0.007636 7.081E-5 0.4684 0.4835 0.4989 3001 12000 mu[2] 0.5043 0.006521 6.684E-5 0.4914 0.5043 0.5171 3001 12000 mu[3] 0.4925 0.006691 6.445E-5 0.4792 0.4924 0.5059 3001 12000 mu[4] 0.5449 0.01305 1.345E-4 0.5189 0.5449 0.5705 3001 12000 mu[5] 0.52 0.008179 8.614E-5 0.5033 0.52 0.536 3001 12000 mu[6] 0.5124 0.007157 7.506E-5 0.498 0.5125 0.5266 3001 12000 mu[7] 0.4873 0.007166 6.734E-5 0.4732 0.4872 0.5018 3001 12000 mu[8] 0.4808 0.008022 7.386E-5 0.465 0.4808 0.497 3001 12000 mu[9] 0.4862 0.007292 6.824E-5 0.4718 0.4862 0.501 3001 12000 mu[10] 0.5131 0.007238 7.598E-5 0.4986 0.5132 0.5275 3001 12000 mu[11] 0.499 0.006429 6.421E-5 0.4864 0.499 0.512 3001 12000 mu[12] 0.4767 0.008688 7.94E-5 0.4593 0.4767 0.4943 3001 12000 mu[13] 0.5241 0.008868 9.323E-5 0.5062 0.5242 0.5415 3001 12000 mu[14] 0.5122 0.00713 7.474E-5 0.4978 0.5122 0.5262 3001 12000 mu[15] 0.4716 0.009598 8.73E-5 0.4525 0.4716 0.4909 3001 12000 mu[16] 0.4878 0.007113 6.698E-5 0.4738 0.4877 0.5022 3001 12000 mu[17] 0.4922 0.006711 6.455E-5 0.4789 0.4922 0.5057 3001 12000 mu[18] 0.4781 0.008451 7.74E-5 0.4614 0.4781 0.4951 3001 12000 mu[19] 0.5256 0.009123 9.582E-5 0.5072 0.5257 0.5434 3001 12000 mu[20] 0.5916 0.02402 2.405E-4 0.5445 0.5915 0.639 3001 12000 mu[21] 0.4806 0.008057 7.414E-5 0.4646 0.4806 0.4968 3001 12000 mu[22] 0.5273 0.00943 9.892E-5 0.5083 0.5274 0.5457 3001 12000 mu[23] 0.4679 0.01032 9.374E-5 0.4475 0.4678 0.4885 3001 12000 mu[24] 0.4773 0.008585 7.853E-5 0.4602 0.4773 0.4947 3001 12000 mu[25] 0.5077 0.006722 6.976E-5 0.4943 0.5078 0.5211 3001 12000 mu[26] 0.5067 0.006652 6.882E-5 0.4935 0.5068 0.5199 3001 12000 mu[27] 0.5082 0.006758 7.024E-5 0.4947 0.5082 0.5216 3001 12000 mu[28] 0.4685 0.0102 9.27E-5 0.4483 0.4684 0.4889 3001 12000 mu[29] 0.4905 0.006852 6.532E-5 0.4769 0.4904 0.5042 3001 12000 mu[30] 0.4884 0.007049 6.655E-5 0.4745 0.4883 0.5027 3001 12000 postprob 0.9998 0.01581 1.431E-4 1.0 1.0 1.0 3001 12000 sigma 0.03472 0.004829 4.704E-5 0.02672 0.03418 0.04542 3001 12000 tausq 876.2 235.3 2.254 484.9 856.2 1401.0 3001 12000
Figure 2. Node Statistics for Informative Prior
Discussion
Based on our analyses, we found a positive relationship between average total payroll and average winning percentage in Major League Baseball for the years 2004 to 2012. For both the non-‐informative and informative methods, the statistics for postprob indicate that Pr(β1≥0|{y}) is about 0.9998. Or in other words, there is a greater than 99% chance that beta1 > 0. These findings are similarly supported by the means and positive 95% credible sets for beta1. Therefore, there does appear to be a linear association between average total payroll and average winning percentage for MLB teams.
6
There was very little difference in the results of the non-‐informative and informative priors. Our belief is that this is due to the informative prior being very consistent with the data. The mean and median for the informative prior are actually slightly larger than those of the non-‐informative prior. This may be an indication that our non-‐informative prior fits the data better, but the difference is very small.
If further exploration of the linear relationship between average total payroll and average winning percentage for MLB teams was completed more information about parameters would help improve the analysis. Additionally, if an ‘averaged’ data set was used in follow-‐up exploration, including more years would be advised. Finally, although it appears that a positive relationship exists between total payroll and winning percentage based on this analysis. It would be important to explore the ongoing changes in the league. Most notably, the debate on the use of statistics for calculating wins based on on-‐base-‐percentage rather than traditional baseball measurements for success. This ongoing development is having an impact on perceived value for many players and may drastically affect a team’s salary and winning percentage.
References
Our dataset was constructed by combining the historical Major League Baseball team salaries and winning percentage. This data was drawn from the same time period, 2004 to 2012, for both variables. Links to these MLB data sources can be found below: Baseball-‐Refernce.com. (2014). Team Wins. Retrieved from http://www.baseball-‐reference.com/leagues/MLB/. USA Today. (2014). USATODAY Salaries Database, MLB salaries by team for various years (2004 to 2014). Retrieved from http://content.usatoday.com/sportsdata/baseball/mlb/salaries/team/2004.
Appendix A: MCMC Integration Convergence
Figure 3 and 4 below are the history plots and auto-‐correlation plots, respectively, for the non-‐informative prior. From these plots it was assessed that convergence occurred quickly for every variable. Postprob’s convergence was assessed using the MC_error found in the results section of this paper.
7
Figure 3. History Plots for Non-‐Informative Prior
Figure 4. Auto-‐Correlation Plots for Non-‐Informative Prior
Figure 5 and 6 below are the history plots and auto-‐correlation plots, respectively,
for the informative prior. From these plots it was assessed that convergence occurred quickly for every variable. Postprob’s convergence was assessed using the MC_error found in the results section of this paper.
8
Figure 5. History Plots for Informative Prior
Figure 6. Auto-‐Correlation Plots for Informative Prior
9
Appendix B: OpenBUGS Code
Non-‐informative Prior model { for (i in 1:N){ xcent[i]<-‐x[i]-‐mean(x[]) } for (i in 1:N){ mu[i]<-‐beta0+beta1*xcent[i] y[i]~dnorm(mu[i],tausq) } postprob<-‐step(beta1) beta0~dflat() beta1~dflat() tausq~dgamma(0.001,0.001) sigma<-‐1/sqrt(tausq) }
#data list(x=c(63.69154422, 89.76840667, 74.92461167, 140.7180136, 109.4106621, 99.92325911, 68.43453544, 60.32229233, 67.06203433, 100.8169706, 83.123759, 55.12364089, 114.6389837, 99.61515889, 48.75051056, 69.04419133, 74.57996711, 56.90573011, 116.4527793, 199.368707, 60.03360822, 118.5733706, 44.04681044, 55.889759, 94.06393778, 92.80824244, 94.66015589, 44.78459711, 72.37762889, 69.80194444),y=c(0.47, 0.54, 0.44, 0.56, 0.48, 0.53, 0.49, 0.49, 0.48, 0.52, 0.46, 0.40, 0.55, 0.52, 0.50, 0.50, 0.50, 0.47, 0.50, 0.58, 0.52, 0.53, 0.42, 0.50, 0.50, 0.46, 0.57, 0.49, 0.54, 0.50), N=30)
#inits list(beta0=0, beta1=0,tausq=1) Informative Prior model { for (i in 1:N){ xcent[i]<-‐x[i]-‐mean(x[]) } for (i in 1:N){ mu[i]<-‐beta0+beta1*xcent[i] y[i]~dnorm(mu[i],tausq) } postprob<-‐step(beta1) beta0~dnorm(0.5, 0.05) beta1~dnorm(0.1, 100) tausq~dgamma(0.001,0.001) sigma<-‐1/sqrt(tausq)
10
}
#data list(x=c(63.69154422, 89.76840667, 74.92461167, 140.7180136, 109.4106621, 99.92325911, 68.43453544, 60.32229233, 67.06203433, 100.8169706, 83.123759, 55.12364089, 114.6389837, 99.61515889, 48.75051056, 69.04419133, 74.57996711, 56.90573011, 116.4527793, 199.368707, 60.03360822, 118.5733706, 44.04681044, 55.889759, 94.06393778, 92.80824244, 94.66015589, 44.78459711, 72.37762889, 69.80194444),y=c(0.47, 0.54, 0.44, 0.56, 0.48, 0.53, 0.49, 0.49, 0.48, 0.52, 0.46, 0.40, 0.55, 0.52, 0.50, 0.50, 0.50, 0.47, 0.50, 0.58, 0.52, 0.53, 0.42, 0.50, 0.50, 0.46, 0.57, 0.49, 0.54, 0.50), N=30)
#inits list(beta0=0, beta1=0,tausq=1)
Appendix C: Inference (R code)
Complete Dataset > data.bb=read.table('C://Users/xli63/Desktop/Baseball.txt', header=TRUE) > attach(data.bb) > head(data.bb) > x <-‐ data.bb$AverageTotalPayroll > pct <-‐ data.bb$AveragePCT > lm.out=lm(pct~x) > summary(lm.out) Call: lm(formula = pct ~ x) Call: lm(formula = pct ~ x) Residuals:
Min 1Q Median 3Q Max -‐0.076797 -‐0.013157 0.007243 0.020457 0.061695 Coefficients:
Estimate Std. Error t value Pr(>|t|) (Intercept) 0.4328659 0.0168386 25.707 < 2e-‐16 *** X 0.0007969 0.0001860 4.286 0.000194 *** -‐-‐-‐ Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.03274 on 28 degrees of freedom Multiple R-‐squared: 0.3961, Adjusted R-‐squared: 0.3746 F-‐statistic: 18.37 on 1 and 28 DF, p-‐value: 0.0001944
Reduced Model (#3 Removed) > data_up2 <-‐ data.bb[-‐c(3),] > xnew2 <-‐ data_up2$AvgTotalPayroll > pct2 <-‐ data_up2$AveragePCT > red_residual_line2 <-‐ lm(pct2~xnew2) > summary(red_residual_line2) Call: lm(formula = pct2 ~ xnew2) Residuals: Min 1Q Median 3Q Max -‐0.079121 -‐0.011606 0.005706 0.018941 0.060047
11
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.4361350 0.0164219 26.558 < 2e-‐16 *** xnew2 0.0007798 0.0001804 4.323 0.000187 *** -‐-‐-‐ Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.03171 on 27 degrees of freedom Multiple R-‐squared: 0.4091, Adjusted R-‐squared: 0.3872 F-‐statistic: 18.69 on 1 and 27 DF, p-‐value: 0.0001872
Reduced Model (#12 Removed) > data_new <-‐ mydata[-‐c(12),] > xnew <-‐ data_new$AverageTotalPayroll > pct_new <-‐ data_new$AveragePCT > remove_residual_line <-‐ lm(pct_new~xnew) > summary(remove_residual_line) Call: lm(formula = pct_new ~ xnew) Residuals: Min 1Q Median 3Q Max -‐0.056063 -‐0.013108 0.004436 0.016632 0.059747 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.4421938 0.0156408 28.272 < 2e-‐16 *** xnew 0.0007190 0.0001709 4.208 0.000255 *** -‐-‐-‐ Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.02964 on 27 degrees of freedom Multiple R-‐squared: 0.396, Adjusted R-‐squared: 0.3737 F-‐statistic: 17.7 on 1 and 27 DF, p-‐value: 0.0002551
Reduced Model (#24 Removed) > data_up1 <-‐ data.bb[-‐c(24),] > xnew1 <-‐ data_up1$Avg.Total.Payroll > pct1 <-‐ data_up1$Average.PCT > red_residual_line <-‐ lm(pct1~xnew1) > plot(red_residual_line) > summary(red_residual_line) Call: lm(formula = pct1 ~ xnew1) Residuals: Min 1Q Median 3Q Max -‐0.075337 -‐0.013510 0.007229 0.017961 0.062273 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.4301762 0.0174143 24.702 < 2e-‐16 *** xnew1 0.0008193 0.0001903 4.305 0.000196 *** -‐-‐
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.03304 on 27 degrees of freedom Multiple R-‐squared: 0.4071, Adjusted R-‐squared: 0.3851 F-‐statistic: 18.54 on 1 and 27 DF, p-‐value: 0.0001965
12
Residual & QQ plots Based on Complete Dataset
Residual & QQ plots Based on Reduced Model (Data point 3 Removed)
50 100 150 200
0.40
0.45
0.50
0.55
x
y
13
Residual & QQ plots Based on Reduced Model (Data point 12 Removed)
Residual & QQ plots Based on Reduced Model (Data point 24 Removed)
14
Appendix D: ‘Averaged’ Dataset AverageTotalPayroll AveragePCT
63.69154422 0.47 89.76840667 0.54 74.92461167 0.44 140.7180136 0.56 109.4106621 0.48 99.92325911 0.53 68.43453544 0.49 60.32229233 0.49 67.06203433 0.48 100.8169706 0.52 83.123759 0.46 55.12364089 0.40 114.6389837 0.55 99.61515889 0.52 48.75051056 0.50 69.04419133 0.50 74.57996711 0.50 56.90573011 0.47 116.4527793 0.50 199.368707 0.58 60.03360822 0.52 118.5733706 0.53 44.04681044 0.42 55.889759 0.50 94.06393778 0.50 92.80824244 0.46 94.66015589 0.57 44.78459711 0.49 72.37762889 0.54 69.80194444 0.50
*For Average Total Payroll 10.1 = 10,100,000
15
Contributions Project proposal: All Members OpenBUGS/R Computing:
-‐ Non-‐informative prior: Lingwen He -‐ Informative prior: Zijian Su -‐ Inference: Xiangyu Li -‐ Additional Computing: Lingwen He, Zijian Su, Xiangyu Li
Interim report: All Members Final report writing and formatting: Padraic O’Shea