The last two weeks I presented regression results on first year Hall of Fame vote percentage. I used a linear regression. This week I use a logit model where 1 means the player has made it in and 0 means not. The probability that a player was voted in is
(1) P = 1/(1 + exp(-Z))
where exp is approximately 2.78 or Euhler's number. The -Z is the following equation times -1 for each player. This is the estimatated regression equation:
(2) -46.1955 + 67.81*CAVG + .5658*100RBI* + 1.386*ALLSTAR + .0013*PA + .3645*MVP + .0001*WSIMP + 14.93*3000HIT + 3.586*C
CAVG is a player's career batting average, 100RBI is the number of seasons with 100+ RBIs, ALLSTAR is number of all-star games played in, PA is career plate appearances, MVP is number of MVP awards won, WSIMP is world series PAs times world series OPS (this is world series impact, a combination of quantity of quality), 3000HIT is a dummy variable (1 or 0) for reaching that milestone and C is the same if the player was a catcher. So all the data gets plugged in for each player and "Z" is calculated. Then the negative of that is plugged in to equation (1) to get the probability of each player being elected into the Hall.
The statistical results can be viewed at logit results. The data for each player and their calculated probability can be seen at logit probabilities.
The results page shows something called a "classification table." It says that if a probability is .5 or greater, the player will be in the Hall. My data includes all players whose first year of eligibility was 1990 or later (except for Pete Rose). The classification table says that all 20 players who actually have made it in should be (all 20 have a probability of .5 or higher). Of the other 161 players it predicts that 159 would not make it. The two who are predicted to make it are Steve Garvey and Andre Dawson. Garvey's 15 years of eligibility to be voted in by the writers has ended. But Dawson could still make it. Garvey had a probability of 95.7% according to the model. Dawson had 60.8%. The model is 98.9% correct.
I tried lots of variables and this model works the best in terms of getting all the actual Hall of Famers right and the overall correct%. Also, some models might have done a bit better but a variable would be negative (like career HRs) that should not be. Some models had many more variables. My Acastat programs sometimes could not complete a regression no matter how long I waited. That might have been if I had more than 12 variables. So another model might be a bit better. I just don't know.
Now it is hard to see what exactly the impact of each variable is since this is a non-linear model. So I will just mention a few players and how their probability (P)would change if their data changed.
Yount-If he does not have 3000 hits, his P falls to about 1% from 99%! Molitor would fall from 99% to 45%.
Ozzie Smith-If he falls from 14 all-star games to 10, his P falls from 99% to 36%
Fisk-If he is not a catcher, his P falls from 96% to 46%. Gary Carter falls from 65% to 36%.
Carew-If he does not have 3000 hits, his P falls only about 1% from 100% to 99%! (Boggs, Gwynn and Brett have the same thing)
Eddie Murray-If he does not have 3000 hits, his P falls to about 83% from 99%.
Puckett-If his CAVG falls from .318 to .301 his P falls from 75% to 49% (a NO). If you don't change his CAVG but take his All-Star games from 10 to 9, his P falls to 43%. Incidentally, he finished his career with 280 Win Shares and if he could have played 5 more years he could have easily gotten to 360 Win Shares, what Bill James says is almost a lock for the Hall (but Tim Raines has 392 and might not make it-if he went from 7 to 10 all-star games his P would be about 75%).
Tony Perez-If you take his All-Star games from 7 to 6, his P falls to 29% from 62%.
McGwire-Maybe his not getting in is why 500 HRs was not working very well in the model. The only others were Murray, Jackson and Schmidt. But if he had 9200 PAs, his P value jumps to .5. Of course, he has the steroid scandal. It would be nice to quantify scandals, but that may be impossible and I have already take Rose out of the model. Back to McGwire, if he had 11 all-star games instead of 9, his P would go up to 69%.
Ryne Sandberg-Take away his MVP award and his P falls from 63% to 54% (do that and take away 1 all-star game, he falls to 23%). For some other guys, it makes almost no difference. Take all 3 of Schmidt's awards away and he still has a 99% P. Take away 1 of Morgan's awards and he falls from 66% to 58%. Take away the other and he falls to 48%.
Al Oliver would go from a P of 10% to over 50% if he had 6 100 RBI seasons instead of 2. Same for Ted Simmons if he jumped from 3 to 7 100 RBI seasons. Same for Harold Baines.
So all-star games and 3000 hits matter alot (and 100 RBI seasons, too). If Dave Parker had 8 all-star game instead of 6, he goes from a P of 8% to over 60%. Harold Baines would have a P of 99% if he had 3000 hits. The 3 strike shortened seasons of 1981, 1994 and 1995 might have cost him 3000 hits. Will Clark and Keith Hernandez would have P's of 99% if they made 3000 hits.
I can email my spread sheet to anyone if you want to play around with these kinds of possibilities.