Sunday, May 3, 2009

Predicting Who Makes The Hall Of Fame Using A Logit Model

The last two weeks I presented regression results on first year Hall of Fame vote percentage. I used a linear regression. This week I use a logit model where 1 means the player has made it in and 0 means not. The probability that a player was voted in is

(1) P = 1/(1 + exp(-Z))

where exp is approximately 2.78 or Euhler's number. The -Z is the following equation times -1 for each player. This is the estimatated regression equation:

(2) -46.1955 + 67.81*CAVG + .5658*100RBI* + 1.386*ALLSTAR + .0013*PA + .3645*MVP + .0001*WSIMP + 14.93*3000HIT + 3.586*C

CAVG is a player's career batting average, 100RBI is the number of seasons with 100+ RBIs, ALLSTAR is number of all-star games played in, PA is career plate appearances, MVP is number of MVP awards won, WSIMP is world series PAs times world series OPS (this is world series impact, a combination of quantity of quality), 3000HIT is a dummy variable (1 or 0) for reaching that milestone and C is the same if the player was a catcher. So all the data gets plugged in for each player and "Z" is calculated. Then the negative of that is plugged in to equation (1) to get the probability of each player being elected into the Hall.

The statistical results can be viewed at logit results. The data for each player and their calculated probability can be seen at logit probabilities.

The results page shows something called a "classification table." It says that if a probability is .5 or greater, the player will be in the Hall. My data includes all players whose first year of eligibility was 1990 or later (except for Pete Rose). The classification table says that all 20 players who actually have made it in should be (all 20 have a probability of .5 or higher). Of the other 161 players it predicts that 159 would not make it. The two who are predicted to make it are Steve Garvey and Andre Dawson. Garvey's 15 years of eligibility to be voted in by the writers has ended. But Dawson could still make it. Garvey had a probability of 95.7% according to the model. Dawson had 60.8%. The model is 98.9% correct.

I tried lots of variables and this model works the best in terms of getting all the actual Hall of Famers right and the overall correct%. Also, some models might have done a bit better but a variable would be negative (like career HRs) that should not be. Some models had many more variables. My Acastat programs sometimes could not complete a regression no matter how long I waited. That might have been if I had more than 12 variables. So another model might be a bit better. I just don't know.

Now it is hard to see what exactly the impact of each variable is since this is a non-linear model. So I will just mention a few players and how their probability (P)would change if their data changed.

Yount-If he does not have 3000 hits, his P falls to about 1% from 99%! Molitor would fall from 99% to 45%.

Ozzie Smith-If he falls from 14 all-star games to 10, his P falls from 99% to 36%

Fisk-If he is not a catcher, his P falls from 96% to 46%. Gary Carter falls from 65% to 36%.

Carew-If he does not have 3000 hits, his P falls only about 1% from 100% to 99%! (Boggs, Gwynn and Brett have the same thing)

Eddie Murray-If he does not have 3000 hits, his P falls to about 83% from 99%.

Puckett-If his CAVG falls from .318 to .301 his P falls from 75% to 49% (a NO). If you don't change his CAVG but take his All-Star games from 10 to 9, his P falls to 43%. Incidentally, he finished his career with 280 Win Shares and if he could have played 5 more years he could have easily gotten to 360 Win Shares, what Bill James says is almost a lock for the Hall (but Tim Raines has 392 and might not make it-if he went from 7 to 10 all-star games his P would be about 75%).

Tony Perez-If you take his All-Star games from 7 to 6, his P falls to 29% from 62%.

McGwire-Maybe his not getting in is why 500 HRs was not working very well in the model. The only others were Murray, Jackson and Schmidt. But if he had 9200 PAs, his P value jumps to .5. Of course, he has the steroid scandal. It would be nice to quantify scandals, but that may be impossible and I have already take Rose out of the model. Back to McGwire, if he had 11 all-star games instead of 9, his P would go up to 69%.

Ryne Sandberg-Take away his MVP award and his P falls from 63% to 54% (do that and take away 1 all-star game, he falls to 23%). For some other guys, it makes almost no difference. Take all 3 of Schmidt's awards away and he still has a 99% P. Take away 1 of Morgan's awards and he falls from 66% to 58%. Take away the other and he falls to 48%.

Al Oliver would go from a P of 10% to over 50% if he had 6 100 RBI seasons instead of 2. Same for Ted Simmons if he jumped from 3 to 7 100 RBI seasons. Same for Harold Baines.

So all-star games and 3000 hits matter alot (and 100 RBI seasons, too). If Dave Parker had 8 all-star game instead of 6, he goes from a P of 8% to over 60%. Harold Baines would have a P of 99% if he had 3000 hits. The 3 strike shortened seasons of 1981, 1994 and 1995 might have cost him 3000 hits. Will Clark and Keith Hernandez would have P's of 99% if they made 3000 hits.

I can email my spread sheet to anyone if you want to play around with these kinds of possibilities.

3 comments:

Matt Mitchell said...

I was thinking of running the same kind of model, but just haven't had the time to build it. How did you do your variable selection?

Cyril Morong said...

Matt

Thanks for dropping by and reading my blog. I started with things I thought would matter and, as I mentioned in the first post on this issue, I have been trying lots of combinations of variables to see which one works the best (in terms of r-squared, standard error, correct predictions). I know it is fishing but I was just looking to try to find out what matters to the voters.

Some of the variables that have been put in are leading the league in HRs and AVG, seasons with a .300 AVG, Gold Glove awards, having reached 500 HRs, 10,000 PAs, career PAs, career HRs and career hits (not including HRs), career RBIs, dummy variables for positions, interaction variables between the positional dummies and other variables (this checks to see if the slope for something like career AVG matters more or less depending upon your position), SBs, a dummy for 500 SB. I think I tried MVP award shares instead of total MVP awards won.

Sometimes you really can't include certain variables together because of collinearity. Like AVG and seasons with .300 AVG or HRs and RBIs. Sometimes I would add in a variable that would make sense but the coeffcieint would be negative even though it should not be (in a few cases gold gloves were negative, maybe because they were correlated with all-star games played in). I think also, that in general, using all-star games is better than dummy variables for postions (like SS or 2B). But having one for catcher seems to make things better. I am thinking of putting Rose back in the model and using a dummy variable for scandal. I guess only Rose and McGwire would get it.

Any thoughts on why the voters did not elect Garvey? My model gives him a very high probability and he has all kinds of accomplishemts that could go on his plaque like seasons with a .3000 AVG, 100 RBIs, Gold Gloves, MVP award, great post season and all-star game stats. Someone said that maybe being a low power 1B man hurts him. But he did very well in his career in MVP award shares.

Matt Mitchell said...

My joking thought on Garvey is that he didn't have the arguments over him like Jim Rice did.

My serious thought relates to a combination of who was on the ballot and artificial voting limits. I'm sure there are members of the BBWAA that will only vote for a certain number of players that is below the actual limit. Looking at Garvey's votes, he was always placing 5th or 6th in the total votes up to 1999, and ironically he was consistently getting more votes than the aforementioned Mr. Rice up to that point to. My guess is that those writers who stubbornly voted for up to a maximum of 5 players would have left Garvey off. Of course, this is a hard point to prove, as not all writers make their voting methodologies known.