I have tried to improve the model I used last week. It was an OLS (linear) regression model (next week I will present a logit or binary model which is non-liner simply looks to see if a player has made the Hall or not-it seems very accurate). This week I have converted some of last week's variables into a non-linear variables and I have added two new variables. At the end of this post I have links to some other research and discussion on this issue.
Here is the regression equation:
PCT = -.041 + .054*MVP + .432*3000H + .172*500HR + .004*ASSQ10 + .001*GGSQ7 + .074*500SB + .00001*WSIMPSQ50 + .102*10000PA
I will explain what the variables mean below. The adjusted r-squared was .901 (last week it was .839). So 90.1% of the difference across players is explained by the equation. The standard error was .081, down from .102. There were 181 players, all of those who came up for the first time from 1990-2009, except for Pete Rose(see last week). MVP is number of MVP awards won, 3000H is a dummy variable (1 if a player reached it, 0 otherwise). The 500HR is also a dummy variable as it is for 500SB and 10000PA (if you made it to 10,000 career plate appearances, you get a 1, 0 otherwise). I used all the voting data from 1990-2009.
What is ASSQ10? It is the square of the number of All-star games played in squared. But AS games played is maxed out at 10. The assumption here is that being an all-star has a positive exponential effect but only up to a point where no more games helps (I have a graph below to help explain this). The GGSQ7 is the same thing for Gold Gloves.
WSIMPSQ50 involves World Series play. First, WSIMP is World Series PAs times OPS. The idea here that the more you play in the World Series the more votes you would get, but by multiplying it by OPS, it also includes how well you played (or just hit). This gets maxed out at 50 and is squared, for the same reason as all-star games (yes, Reggie Jackson is first here and way ahead of everyone else at 141, with Dave Justice and Lonnie Smith tied for 2nd at 101).
All of the variables were significant at the 10% level except for GGSQ7, which came close with a p-value of .13. The other variables all had p-values of under .02 with 6 under .01.
This has been the best linear model I can come up with so far. Many different variables have been tried. 154 of the 181 players were predicted to within 10 percentage points. 126 or 69.6% were within 5 points.
The graph below illustrates what I mentioned above about squaring and capping variables. Notice that the line is increasing exponentially then flatlines.
Below are the players who had the biggest negative prediction differentials. Lynn, for example, was predicted to get 35.8% (.358) of the vote but got only 5.5% (.055).
0.055 *** -0.303 Lynn, Fred
0.235 *** -0.230 McGwire, Mark
0.017 *** -0.180 Bell, Buddy
0.083 *** -0.154 Nettles, Graig
0.053 *** -0.152 Baines, Harold
0.017 *** -0.151 Parrish, Lance
0.005 *** -0.144 Lopes, Davey
0.068 *** -0.140 Concepcion, Dave
0.019 *** -0.131 Cey, Ron
0.051 *** -0.130 Hernandez, Keith
Now the players who had the biggest positive prediction differentials.
0.298 *** 0.080 Rice, Jim
0.852 *** 0.095 Molitor, Paul
0.157 *** 0.109 Trammell, Alan
0.965 *** 0.111 Schmidt, Mike
0.775 *** 0.126 Yount, Robin
0.818 *** 0.164 Morgan, Joe
0.500 *** 0.189 Perez, Tony
0.664 *** 0.296 Fisk, Carlton
0.917 *** 0.319 Smith, Ozzie
0.821 *** 0.394 Puckett, Kirby
I also ran a regression which had the following variables. None of them were squared or made non-linear. 2B, SS and C are positional dummies. All of these variables were significant at the 10% level. 9 were significant at the 5% level and 7 the 1% level. But the adjusted r-squared was .763 and he standard error was .126. So it does not work nearly as well as the model mentioned above.
Now some links to other research
Baseball Hall of Fame voting: a test of the customer discrimination by Arna Desser, James Monks and Michael Robinson
Social Science Quarterly Sept 1999 v80 i3 p591(13)
Discussion of above article
A neural-net Hall of Fame prediction method
Teaching Statistical Thinking Using the Baseball Hall of Fame by Steven Wang
Who's missing from the hall of fame by JC Bradbury
Modeling Election to the Major League Baseball Hall of Fame through the use of Genetic Algorithms By David Cohen