## Monday, April 20, 2009

### What Determines Vote Percentage In The First Year Of Hall Of Fame Eligibility?

I tried several models and maybe I will update this in the next several days, explaining what some of them were and why I am posting this one first. But using OLS regression, here is the equation I came up with:

Pct = -.08 + .00011*SB + .00741*GG + .071*MVP + .032*AS + .512*3000H + .29*500HR

These are all career totals. GG is number of Gold Gloves won, MVP is number of MVP awards won, 3000H is a dummy variable (1 if a player reached it, 0 otherwise). The 500HR is also a dummy variable. I used all the voting data from 1990-2009. Any player who had received votes before 1990 was not counted. Pete Rose was not included since leaving him out improved the results and he got nowhere near what the model predicts (usually about 70-80 points lower-in 1992 he got just 9.5% of the vote). So the scandals and controversies probably played a role. Mark McGwire got a lower % than predicted, but it was not anywhere near as bad as it was for Rose.

There were 182 players. The adjusted r-squared was .839 and the standard error was .104 (that is the lowest I got and I looked at many models with lots of different combinations of many variables). Perhaps this kind of fishing expedition, just looking for the most accurate regression, is not legitimate. But maybe it is the only way to figure out what the voters care about. All of the variables were significant at the 5% level (the highest p-value was about .04 and the next highest was .012)

McGwire did have the biggest negative difference between his predicted % and what he actually got. The equation predicts about 50.8% while he only got 23.5%, for a differential of -.273. But Fred Lynn came in .262 below his predicted value of .317. Here are the bottom ten in differential

-0.27344 McGwire, Mark
-0.26287 Lynn, Fred
-0.19284 Hernandez, Keith
-0.1864 Ripken, Cal
-0.15364 Parrish, Lance
-0.14969 Concepcion, Dave
-0.14829 Murphy, Dale
-0.14131 Cedeno, Cesar
-0.13065 Fernandez, Tony
-0.13031 McGee, Willie

Now the top ten

0.12798 Brett, George
0.12806 Schmidt, Mike
0.15458 Carter, Gary
0.17142 Molitor, Paul
0.24404 Jackson, Reggie
0.34928 Perez, Tony
0.35425 Morgan, Joe
0.38621 Smith, Ozzie
0.40061 Fisk, Carlton
0.5199 Puckett, Kirby

So the question is why did these guys get so many more votes than predicted (and why did those other guys get so many less)? I will have to think about that some more and maybe I can improve the model if I come up with something.

Brett and Schmidt would both still have made it in on the first ballot. But Puckett was only predicted to get .301. Many of the players here in the top ten also got alot more than predicted in at least one other model. That model had the following variables

HR
AVG
SB
GG
MVP
3000 HIT
2B
SS
C

The last 3 are positional dummies. But the standard error on this model was .132, much higher than the other model.