I have tried to improve the model I used last week. It was an OLS (linear) regression model (next week I will present a logit or binary model which is non-liner simply looks to see if a player has made the Hall or not-it seems very accurate). This week I have converted some of last week's variables into a non-linear variables and I have added two new variables. At the end of this post I have links to some other research and discussion on this issue.
Here is the regression equation:
PCT = -.041 + .054*MVP + .432*3000H + .172*500HR + .004*ASSQ10 + .001*GGSQ7 + .074*500SB + .00001*WSIMPSQ50 + .102*10000PA
I will explain what the variables mean below. The adjusted r-squared was .901 (last week it was .839). So 90.1% of the difference across players is explained by the equation. The standard error was .081, down from .102. There were 181 players, all of those who came up for the first time from 1990-2009, except for Pete Rose(see last week). MVP is number of MVP awards won, 3000H is a dummy variable (1 if a player reached it, 0 otherwise). The 500HR is also a dummy variable as it is for 500SB and 10000PA (if you made it to 10,000 career plate appearances, you get a 1, 0 otherwise). I used all the voting data from 1990-2009.
What is ASSQ10? It is the square of the number of All-star games played in squared. But AS games played is maxed out at 10. The assumption here is that being an all-star has a positive exponential effect but only up to a point where no more games helps (I have a graph below to help explain this). The GGSQ7 is the same thing for Gold Gloves.
WSIMPSQ50 involves World Series play. First, WSIMP is World Series PAs times OPS. The idea here that the more you play in the World Series the more votes you would get, but by multiplying it by OPS, it also includes how well you played (or just hit). This gets maxed out at 50 and is squared, for the same reason as all-star games (yes, Reggie Jackson is first here and way ahead of everyone else at 141, with Dave Justice and Lonnie Smith tied for 2nd at 101).
All of the variables were significant at the 10% level except for GGSQ7, which came close with a p-value of .13. The other variables all had p-values of under .02 with 6 under .01.
This has been the best linear model I can come up with so far. Many different variables have been tried. 154 of the 181 players were predicted to within 10 percentage points. 126 or 69.6% were within 5 points.
The graph below illustrates what I mentioned above about squaring and capping variables. Notice that the line is increasing exponentially then flatlines.
Below are the players who had the biggest negative prediction differentials. Lynn, for example, was predicted to get 35.8% (.358) of the vote but got only 5.5% (.055).
0.055 *** -0.303 Lynn, Fred
0.235 *** -0.230 McGwire, Mark
0.017 *** -0.180 Bell, Buddy
0.083 *** -0.154 Nettles, Graig
0.053 *** -0.152 Baines, Harold
0.017 *** -0.151 Parrish, Lance
0.005 *** -0.144 Lopes, Davey
0.068 *** -0.140 Concepcion, Dave
0.019 *** -0.131 Cey, Ron
0.051 *** -0.130 Hernandez, Keith
Now the players who had the biggest positive prediction differentials.
0.298 *** 0.080 Rice, Jim
0.852 *** 0.095 Molitor, Paul
0.157 *** 0.109 Trammell, Alan
0.965 *** 0.111 Schmidt, Mike
0.775 *** 0.126 Yount, Robin
0.818 *** 0.164 Morgan, Joe
0.500 *** 0.189 Perez, Tony
0.664 *** 0.296 Fisk, Carlton
0.917 *** 0.319 Smith, Ozzie
0.821 *** 0.394 Puckett, Kirby
I also ran a regression which had the following variables. None of them were squared or made non-linear. 2B, SS and C are positional dummies. All of these variables were significant at the 10% level. 9 were significant at the 5% level and 7 the 1% level. But the adjusted r-squared was .763 and he standard error was .126. So it does not work nearly as well as the model mentioned above.
AVG
MVP
3000 HIT
2B
SS
C
SB
WSIMP
10000PA
HR
Now some links to other research
Baseball Hall of Fame voting: a test of the customer discrimination by Arna Desser, James Monks and Michael Robinson
Social Science Quarterly Sept 1999 v80 i3 p591(13)
Discussion of above article
A neural-net Hall of Fame prediction method
Teaching Statistical Thinking Using the Baseball Hall of Fame by Steven Wang
Who's missing from the hall of fame by JC Bradbury
Modeling Election to the Major League Baseball Hall of Fame through the use of Genetic Algorithms By David Cohen
Sunday, April 26, 2009
Monday, April 20, 2009
What Determines Vote Percentage In The First Year Of Hall Of Fame Eligibility?
I tried several models and maybe I will update this in the next several days, explaining what some of them were and why I am posting this one first. But using OLS regression, here is the equation I came up with:
Pct = -.08 + .00011*SB + .00741*GG + .071*MVP + .032*AS + .512*3000H + .29*500HR
These are all career totals. GG is number of Gold Gloves won, MVP is number of MVP awards won, 3000H is a dummy variable (1 if a player reached it, 0 otherwise). The 500HR is also a dummy variable. I used all the voting data from 1990-2009. Any player who had received votes before 1990 was not counted. Pete Rose was not included since leaving him out improved the results and he got nowhere near what the model predicts (usually about 70-80 points lower-in 1992 he got just 9.5% of the vote). So the scandals and controversies probably played a role. Mark McGwire got a lower % than predicted, but it was not anywhere near as bad as it was for Rose.
There were 182 players. The adjusted r-squared was .839 and the standard error was .104 (that is the lowest I got and I looked at many models with lots of different combinations of many variables). Perhaps this kind of fishing expedition, just looking for the most accurate regression, is not legitimate. But maybe it is the only way to figure out what the voters care about. All of the variables were significant at the 5% level (the highest p-value was about .04 and the next highest was .012)
McGwire did have the biggest negative difference between his predicted % and what he actually got. The equation predicts about 50.8% while he only got 23.5%, for a differential of -.273. But Fred Lynn came in .262 below his predicted value of .317. Here are the bottom ten in differential
-0.27344 McGwire, Mark
-0.26287 Lynn, Fred
-0.19284 Hernandez, Keith
-0.1864 Ripken, Cal
-0.15364 Parrish, Lance
-0.14969 Concepcion, Dave
-0.14829 Murphy, Dale
-0.14131 Cedeno, Cesar
-0.13065 Fernandez, Tony
-0.13031 McGee, Willie
Now the top ten
0.12798 Brett, George
0.12806 Schmidt, Mike
0.15458 Carter, Gary
0.17142 Molitor, Paul
0.24404 Jackson, Reggie
0.34928 Perez, Tony
0.35425 Morgan, Joe
0.38621 Smith, Ozzie
0.40061 Fisk, Carlton
0.5199 Puckett, Kirby
So the question is why did these guys get so many more votes than predicted (and why did those other guys get so many less)? I will have to think about that some more and maybe I can improve the model if I come up with something.
Brett and Schmidt would both still have made it in on the first ballot. But Puckett was only predicted to get .301. Many of the players here in the top ten also got alot more than predicted in at least one other model. That model had the following variables
HR
AVG
SB
GG
MVP
3000 HIT
2B
SS
C
The last 3 are positional dummies. But the standard error on this model was .132, much higher than the other model.
Pct = -.08 + .00011*SB + .00741*GG + .071*MVP + .032*AS + .512*3000H + .29*500HR
These are all career totals. GG is number of Gold Gloves won, MVP is number of MVP awards won, 3000H is a dummy variable (1 if a player reached it, 0 otherwise). The 500HR is also a dummy variable. I used all the voting data from 1990-2009. Any player who had received votes before 1990 was not counted. Pete Rose was not included since leaving him out improved the results and he got nowhere near what the model predicts (usually about 70-80 points lower-in 1992 he got just 9.5% of the vote). So the scandals and controversies probably played a role. Mark McGwire got a lower % than predicted, but it was not anywhere near as bad as it was for Rose.
There were 182 players. The adjusted r-squared was .839 and the standard error was .104 (that is the lowest I got and I looked at many models with lots of different combinations of many variables). Perhaps this kind of fishing expedition, just looking for the most accurate regression, is not legitimate. But maybe it is the only way to figure out what the voters care about. All of the variables were significant at the 5% level (the highest p-value was about .04 and the next highest was .012)
McGwire did have the biggest negative difference between his predicted % and what he actually got. The equation predicts about 50.8% while he only got 23.5%, for a differential of -.273. But Fred Lynn came in .262 below his predicted value of .317. Here are the bottom ten in differential
-0.27344 McGwire, Mark
-0.26287 Lynn, Fred
-0.19284 Hernandez, Keith
-0.1864 Ripken, Cal
-0.15364 Parrish, Lance
-0.14969 Concepcion, Dave
-0.14829 Murphy, Dale
-0.14131 Cedeno, Cesar
-0.13065 Fernandez, Tony
-0.13031 McGee, Willie
Now the top ten
0.12798 Brett, George
0.12806 Schmidt, Mike
0.15458 Carter, Gary
0.17142 Molitor, Paul
0.24404 Jackson, Reggie
0.34928 Perez, Tony
0.35425 Morgan, Joe
0.38621 Smith, Ozzie
0.40061 Fisk, Carlton
0.5199 Puckett, Kirby
So the question is why did these guys get so many more votes than predicted (and why did those other guys get so many less)? I will have to think about that some more and maybe I can improve the model if I come up with something.
Brett and Schmidt would both still have made it in on the first ballot. But Puckett was only predicted to get .301. Many of the players here in the top ten also got alot more than predicted in at least one other model. That model had the following variables
HR
AVG
SB
GG
MVP
3000 HIT
2B
SS
C
The last 3 are positional dummies. But the standard error on this model was .132, much higher than the other model.
Sunday, April 12, 2009
The Incredible Dominance of The 1936-39 Yankees
They won 4 straight world series. So you might be wondering why I don't write about the 1949-53 Yankees, who won 5 in a row. Those latter Yankees seem pretty dominiating. But I don't think in any 4 year period any team has done what those earlier Yankees did. The 1936-39 team had winning percentages of .667, .662, .651 and .702. Their only losing month in the 4 years was Sept. 1938 when they went 13-14.
First, they lead their league in ERA, HRs, Runs, fewest runs allowed and SLG in all 4 years. And, if I recall correctly, their run differential of 411 in 1939 is the highest of all time (967-556). The following table only amplifies their regular season dominance (the lines are in order of the years starting with 1936).
They spent 530 days in first place. That was about 80% of the time. The Indians were a half game ahead of the Yankees on July 12, 1938. But neither team had played 77 games yet. So no team besides the Yankees was in first place during the 2nd half of any of these 4 seasons. Where I have "Games left to play on clinch date," I simply added wins and losses and then subtracted from 154. I did not try to take ties or actual games left into account. So it is an approximation. But the lowest total was 12. That was the closest "race."
UPDATE: (April 16) I also figured out what the closest any team got to the Yankees was in each year between Sept. 1 and the date they clinched. Here are those games behind numbers in order of the years: 16-9-13-11.5. So between Sept. 1, 1936 and the date they clinched, the fewest games behind for the 2nd place team was 16. And in all 4 years, the closest any team came to them between Sept. 1 and the clinch date was 9 games, in 1937.
Then notice that no team ever finished closer than 9.5 games with the average games ahead being 14.75. Also notice that going into Sept., the closest anyone got was 11 games. In 1938, they went from being a half game behind on July 12 to being up 14 games up by Aug 31. In about 7 weeks, they gained 14 games.
You might be wondering if they kept up this dominance in the World Series. After all, they were playing the best the National League could throw at them. The table below summarizes what happened.
Notice that their edges in SLG and OBP are much larger than their edge in AVG (maybe they had a sabermetrician working for them back then). OBP just used walks, hits and atbats. They had 23 more walks while having huge leads in HRs and TB. Their winning pct against the NL champs was .842 while never losing more than 2 games in any one series. Their Pythagorean winning pct was .825 (runs scored squared divided by the sum of runs scored squared + runs allowed squared).
From other research, I have found that winning pct is about 1.21*(OPS differential) + .500. OPS is OBP + SLG. The Yankees had a .760 OPS while their opponents had .590. That would give the Yankees a pct of .706, not as dominat as what actually happened but still pretty impressive considering the competition.
First, they lead their league in ERA, HRs, Runs, fewest runs allowed and SLG in all 4 years. And, if I recall correctly, their run differential of 411 in 1939 is the highest of all time (967-556). The following table only amplifies their regular season dominance (the lines are in order of the years starting with 1936).
They spent 530 days in first place. That was about 80% of the time. The Indians were a half game ahead of the Yankees on July 12, 1938. But neither team had played 77 games yet. So no team besides the Yankees was in first place during the 2nd half of any of these 4 seasons. Where I have "Games left to play on clinch date," I simply added wins and losses and then subtracted from 154. I did not try to take ties or actual games left into account. So it is an approximation. But the lowest total was 12. That was the closest "race."
UPDATE: (April 16) I also figured out what the closest any team got to the Yankees was in each year between Sept. 1 and the date they clinched. Here are those games behind numbers in order of the years: 16-9-13-11.5. So between Sept. 1, 1936 and the date they clinched, the fewest games behind for the 2nd place team was 16. And in all 4 years, the closest any team came to them between Sept. 1 and the clinch date was 9 games, in 1937.
Then notice that no team ever finished closer than 9.5 games with the average games ahead being 14.75. Also notice that going into Sept., the closest anyone got was 11 games. In 1938, they went from being a half game behind on July 12 to being up 14 games up by Aug 31. In about 7 weeks, they gained 14 games.
You might be wondering if they kept up this dominance in the World Series. After all, they were playing the best the National League could throw at them. The table below summarizes what happened.
Notice that their edges in SLG and OBP are much larger than their edge in AVG (maybe they had a sabermetrician working for them back then). OBP just used walks, hits and atbats. They had 23 more walks while having huge leads in HRs and TB. Their winning pct against the NL champs was .842 while never losing more than 2 games in any one series. Their Pythagorean winning pct was .825 (runs scored squared divided by the sum of runs scored squared + runs allowed squared).
From other research, I have found that winning pct is about 1.21*(OPS differential) + .500. OPS is OBP + SLG. The Yankees had a .760 OPS while their opponents had .590. That would give the Yankees a pct of .706, not as dominat as what actually happened but still pretty impressive considering the competition.
Monday, April 6, 2009
How Many Home Runs Would Ruth Have Hit If Baseball Had Been Integrated In His Era?
(Note: This is a slightly revised version of an article that was published in 2007 in the now defunt print periodical called "The Chicago Sports Weekly." I also had posted something like this at "Beyond the Boxscore" called How Would Integration Have Affected Ruth and Cobb?)
Maybe you have seen the images on TV of fans around the country holding up asterisk signs when Barry Bonds comes to the plate, hinting that his HR record is tainted, due to his alleged steroid use. But others counter that Babe Ruth might deserve an asterisk since he never faced blacks or dark-skinned Hispanics (there were a few players with Hispanic names before 1947 whose skin was generally pretty light).
But this raises the question of how many HRs would Ruth have hit had there not been a color barrier? I know the answer because Clio, the Greek muse of history, whispered it in my ear. You see, my Ph. D. thesis was in the field of economic history and its application of statistics is called “cliometrics.” What I am about to attempt here is something dangerous called a “counterfactual” in this field. So don’t try it at home. Leave it to the trained professionals.
Robert Fogel, economic historian at the University of Chicago, won a Noble Prize, partly for using counterfactuals. He supposed what if railroads had not been built. What other kind of transportation system (like canals) would have emerged? How would this have affected economic growth? He concluded that GDP in 1890 would have been about 5% lower than it actually was.
Not everyone was thrilled with this approach. The historian Fritz Redlich referred to counterfactuals as figments, probably of an imagination gone wild. So maybe you will think this analysis is a figment of my imagination. So. Maybe you’re a figment of my imagination. In any case, here it is.
First, we need an estimate of how many non-white pitchers there might have been. Since 1947, about 15% of all the IP by pitchers with 1,000+ IP in their careers have been by non-whites. All of the 1,000+ IP pitchers made up about 58% of all the IP since 1947, so it is a good sample. Therefore, I assume that in Ruth’s day 15% of the IP were by non-whites.
How good would those pitchers have been? Good enough to replace some white guys, who would be the worst pitchers in the league. You don’t add Satchel Paige to your team and then get rid of Lefty Grove. You dump Grover Lowdermilk (who really was not a bad pitcher but his name sounds funny, unlike mine). The non-whites with 1,000+ IP since 1947 actually had a collective ERA just about the same as the whites. So pre-1947, you dump the worst 15% of the pitchers by ERA and re-calculate the league HR rate using the remaining pitchers or the top 85%
After getting rid of the bottom 15% of the IP in each season from 1920 to 1934 (when Ruth played with the Yankees and had all of his great seasons) in the AL, I recalculated the HRs allowed per IP and found how much lower than the league average the new figures were. The average fall in HRs per IP for the years 1920-1934 was about 5%. That is, the best 85% of the pitchers had a HR per IP rate that was 5% lower than the league average (which includes all pitchers). So if you improve the pitching quality in a way that is consistent with integration, Ruth would hit 5% fewer HRs or hit about 678. Even if we cut him 10%, he still hits 643.
Some things I have not considered: when Aaron and Mays were hitting HRs in the 1950s, there still were not that many non-whites pitching. So their totals might need to be reduced. We also don’t know if all batters would be affected in the same way. The best HR hitters might have had their totals reduced more than the average hitter. Also, we don’t know what percentage of pitchers would have been non-white. Probably it is more than 15% today. Suppose it is 25%. I looked at the 1927 AL and if you only count the best 75% of the pitchers, the HR rate falls about 9%.
Suppose we only looked at the best 50% of the pitchers from 1927, HRs would fall about 18.3%. If that happened to Ruth over his whole career, he still hits 583 HRs. It is about 17% for 1921. For 1934, it would be 20%. Given that I am only counting the best 50% of the pitchers, we can safely say that integration would have reduced his HR’s by no more than 20% (the top 50% of pitchers in 2008 gave up about 20% fewer HRs than average as well). So he ends up with 571 HRs. That would have stood as a record for quite awhile. And remember that we would have to reduce Aaron and Mays since they played a good part of their careers when there were not as many non-white pitchers as today.
Here is the link that shows the white and non-white pitchers since 1947
http://cyrilmorong.com/RuthAsterisk/Pitchers.htm
Maybe you have seen the images on TV of fans around the country holding up asterisk signs when Barry Bonds comes to the plate, hinting that his HR record is tainted, due to his alleged steroid use. But others counter that Babe Ruth might deserve an asterisk since he never faced blacks or dark-skinned Hispanics (there were a few players with Hispanic names before 1947 whose skin was generally pretty light).
But this raises the question of how many HRs would Ruth have hit had there not been a color barrier? I know the answer because Clio, the Greek muse of history, whispered it in my ear. You see, my Ph. D. thesis was in the field of economic history and its application of statistics is called “cliometrics.” What I am about to attempt here is something dangerous called a “counterfactual” in this field. So don’t try it at home. Leave it to the trained professionals.
Robert Fogel, economic historian at the University of Chicago, won a Noble Prize, partly for using counterfactuals. He supposed what if railroads had not been built. What other kind of transportation system (like canals) would have emerged? How would this have affected economic growth? He concluded that GDP in 1890 would have been about 5% lower than it actually was.
Not everyone was thrilled with this approach. The historian Fritz Redlich referred to counterfactuals as figments, probably of an imagination gone wild. So maybe you will think this analysis is a figment of my imagination. So. Maybe you’re a figment of my imagination. In any case, here it is.
First, we need an estimate of how many non-white pitchers there might have been. Since 1947, about 15% of all the IP by pitchers with 1,000+ IP in their careers have been by non-whites. All of the 1,000+ IP pitchers made up about 58% of all the IP since 1947, so it is a good sample. Therefore, I assume that in Ruth’s day 15% of the IP were by non-whites.
How good would those pitchers have been? Good enough to replace some white guys, who would be the worst pitchers in the league. You don’t add Satchel Paige to your team and then get rid of Lefty Grove. You dump Grover Lowdermilk (who really was not a bad pitcher but his name sounds funny, unlike mine). The non-whites with 1,000+ IP since 1947 actually had a collective ERA just about the same as the whites. So pre-1947, you dump the worst 15% of the pitchers by ERA and re-calculate the league HR rate using the remaining pitchers or the top 85%
After getting rid of the bottom 15% of the IP in each season from 1920 to 1934 (when Ruth played with the Yankees and had all of his great seasons) in the AL, I recalculated the HRs allowed per IP and found how much lower than the league average the new figures were. The average fall in HRs per IP for the years 1920-1934 was about 5%. That is, the best 85% of the pitchers had a HR per IP rate that was 5% lower than the league average (which includes all pitchers). So if you improve the pitching quality in a way that is consistent with integration, Ruth would hit 5% fewer HRs or hit about 678. Even if we cut him 10%, he still hits 643.
Some things I have not considered: when Aaron and Mays were hitting HRs in the 1950s, there still were not that many non-whites pitching. So their totals might need to be reduced. We also don’t know if all batters would be affected in the same way. The best HR hitters might have had their totals reduced more than the average hitter. Also, we don’t know what percentage of pitchers would have been non-white. Probably it is more than 15% today. Suppose it is 25%. I looked at the 1927 AL and if you only count the best 75% of the pitchers, the HR rate falls about 9%.
Suppose we only looked at the best 50% of the pitchers from 1927, HRs would fall about 18.3%. If that happened to Ruth over his whole career, he still hits 583 HRs. It is about 17% for 1921. For 1934, it would be 20%. Given that I am only counting the best 50% of the pitchers, we can safely say that integration would have reduced his HR’s by no more than 20% (the top 50% of pitchers in 2008 gave up about 20% fewer HRs than average as well). So he ends up with 571 HRs. That would have stood as a record for quite awhile. And remember that we would have to reduce Aaron and Mays since they played a good part of their careers when there were not as many non-white pitchers as today.
Here is the link that shows the white and non-white pitchers since 1947
http://cyrilmorong.com/RuthAsterisk/Pitchers.htm