Wednesday, January 27, 2010

How Might Integration Have Affected The Lefty Grove/Randy Johnson Debate?


I wrote a post comparing these two along with Sandy Koufax a few days ago in response to a piece by Joe Sheehan. A commentor at B aseball Think Factory mentioned that Lefty Grove did not have to face blacks and Hispanics. So here I try estimate how his value could be affected by pitching before integration. The method I use will be similar to the one I used in this article: How Would Integration Have Affected Ruth and Cobb?

I assume that integration increases the talent pool available to baseball. So the relevant questions seem to be how much better would the hitters have been in Grove's time (AL, 1925-41) if there was no segregation and how many more runs would he have given up by having to face better hitters. But I also tried to take into account how much better the fielders and pitchers would have been. I assumed that the new, non-white talent would be about as good as they are now (relative to whites) and would replace the worst players. Then I tried to calculate how the league OPS and league ERA would change based on how good I assumed the new talent to be. Grove's ERA was also adjusted to account for the better pitchers as well as the better fielders behind him. My rough estimate is that his ERA relative to the league average would fall from 144 (or 44% better than average) to 130. Randy Johnson had 133. I think the difference between the two is still small enough to keep Grove in the debate as the greatest lefty ever.


Since the comparison is to Randy Johnson, I first looked to see what percentage of the players were non-white from 1988-2009 and how well they hit compared to the white players. I did this position by position (I try to exlain why at the end in "technical notes"). I found the top 100 players in PAs at each position (but used the top 300 for OFers) from 1988-2009 using the Lee Sinins Complete Baseball Encyclopedia. This group of players combined to make up about 72% of all PAs during this time.

If I did not feel sure about a player being white or non-white, I found a picture of his baseball card on eBay. For the most part, anyone with an Hispanic name was considered non-white. There were a few player with Hispanic names before 1947, like Lefty Gomez. But I don't think it was very many. I put someone like Mike Gallego in the white category since he had what is considered an "Anglo" first name and he was born in the U.S.

The table below shows the weighted-average OPS of whites and non-whites along with the percentage of PAs at that position by non-whites from 1988-2009.

I assumed that these would be the percentages in the AL from 1925-41. For example, I assumed that 34.8% of the 1B men would be non-white. So I took all the 1B men who had 100 or PAs and ranked them from highest to lowest in OPS. Then I removed the bottom 34.8% (approximately) of the PAs. The idea is that teams would get rid of their worst players when adding in the new, better players from the newly used talent pool.

The players who comprised the top 65.2% of PAs were designated as white players and they had an OPS of .926. The 35.8% of the remaining PAs were assigned to players who were assumed to be non-white and I assumed they would have a collective OPS of .914 (.012 lower than the white, what I found in the first table). Combining both groups together gives an OPS of .922. Before this adjustment, the OPS of all the 1B men was .857. So with the added talent, the new cumulative OPS for 1B men would be about .065 higher than before.

Here are all of the increases in OPS at each position

1B 0.065
2B 0.080
SS 0.094
3B 0.034
OF 0.095
C 0.045

The weighted average of all these gains is .076. But since I did not include pitchers, I am just going to assume no change for them. So that would bring the change down to .070 (the weighting I used for pitchers was 8.5% based on Retrosheet data).

But before I try to recalculate how many runs Lefty Grove might have given up, I think better fielding has to be taken into account. In my article on Cobb and Ruth linked above, I estimated how much better the fielding would have been since integration began(in a manner similar to how I adjust the hitting stats-the details are explained at the link).

I found that OFers would have been better in putouts per game of by about 15% and assists per game for SS and 2B would have been better by about 5% (I think those stats for their respective positions made sense). Of course, all fielders cannot raise their putouts and/or assists per game since there are only 27 outs per game. But in the adjustment I made, I will assume that the number of hits on balls in play will have to drop by a certain percentage.

I went with 7.5%, which is in the range of the numbers I found for the OFers, SS and 2B men. Those infielders, of course, would also have more putouts. But I think their big contribution would be in throwing more batters out at first. There might be some improvement at 3B and 1B, but those are still mostly white positions, so any change will be slight.

So I assumed that there would have been a reduction in hits on balls in play of 7.5. That means 7.5% fewer, singles, doubles, and triples. At-bats would fall 7.5%, too. Once those changes were made, I recalculated the AL SLG and OBP from 1925-41. The new OBP would be .339 (down from .350) and the new SLG would be .386 (down from .404). The new OPS would be .725, down from .750. So a decline of .025.

But the improved hitting increased OPS by about .070. So subtracting the .025 leaves a .045 increase in OPS. How much would Grove's ERA increase if the OPS he allowed went up .045? To estimate that, I ran a regression on all the AL teams in his era with runs per game as the dependent variable and team OPS as the dependent variable. Here is the equation:

R/G = 14.23*OPS - 5.62

Since 14.23*.045 = .64, I assumed that every pitcher would see his ERA rise by that much (that may not be the case, some might go up more and some might go up less-I am just not sure how to figure that out-but I also tried raising each pitcher's OPS by the percentage increase, too, as explained below).

So let's say that Grove's ERA goes up by 0.64. His career ERA is now 3.70 (it was actually 3.06). If the league ERA went up the same amount, it would be 5.08. But that 5.08 is too high because it would be brought down by the fact that the new, non-white pitchers would raise the overall quality of the league pitching. How much might that be?

First, I assumed that about 22% of the innings would have been pitched by non-whites. So the worst 22% of the white pitchers would be removed. How good would the non-white pitchers be? Overall, as a group, about as good as the white pitchers who remained.

I figured that out by finding the top 288 pitchers in IP from 1988-2009. They collectively had about half the major league IP in this time. After separating them into white and non-white groups, I found that both had about the same ERA relative to the league average. The whites were 6% better and the non-whites were 7% better. So that is why I assumed in the previous paragraph that the incoming pitchers would be about as good as those that remained.

I ranked every pitcher in the AL in the ERA from lowest to highest ERA. Then I only kept the guys who combined to have about 78% of the IP (adding from lowest ERA to highest). This group of pitchers was about 5.7% better than the entire group. So I assumed that by adding the non-white pitchers, who would be as good as the remaining pitchers, would lower the league ERA by 5.7%.

Earlier I said the hitting and fielding would combine to raise the league ERA to 5.0. If it is lowered 5.7%, then it would be 4.81. How does this affect Grove? Before any adjustments, his relative ERA or ERA+ was 144 since 3.06/4.42 = .692 and 1/.692 = 1.44 (and then it is multiplied by 100). But we had his ERA rising to 3.7 and the league ERA rising to 4.81. Then 3.7/4.81 = .769 and 1/.769 = 1.3. That would give him a relative ERA of 130.

What did Randy Johnson have? 133. This still puts the two pitchers very close. What I have done is a very rough estimate. I think a more complete and thorough adjustment could leave Grove a little ahead or a little behind. But either way, Grove still deserves consideration as the greatest lefty every.

If fielding improved by some other percentages, here is what Grove's ERA+ would be

5% 128
10% 131
15% 133

I also tried to raise Grove's OPS allowed by the same percentage by which the league OPS increased instead of the absolute increase. Since I had the league OPS rising by .045 and it was actually .750, that is a 6% increase. Grove gave up 3.64 runs per 9 IP. To score that many runs per game, a team would have an OPS of .651 (so I assumed that was Grove's OPS allowed). If his OPS allowed went up 6%, it would be .690 or a .039 increase. In that case, his runs allowed would go up by 14.23*.039 = .56. That is a bit lower than the .064 mentioned earlier. In this case, his ERA goes up to 3.62. Then his ERA relative to the league average would be 1.33 (3.62/4.81 = .753 and 1/.753 = 133).

I am not sure if Grove's OPS allowed should increased by an absolute amount or a percentge. But his adjusted ERA ends up being about the same in each case.

Technical Notes

The reason I adjusted the league OPS by position instead of just looking at all players is that I ended up eliminating mostly players who were catchers and infielders. Of course, the players eliminated have be proportionate at each position.

Here is what happened. I initially found that from 1988-2009 the whites and non-whites both had about the same OPS (.771). The non-whites were about 51% of the players in the top 900 on PAs from 1988-2009. So I then found all of the players in the AL from 1925-41 with 100 or career PAs. Then I ranked them from highest to lowest OPS and dropped enough players from the lower ranks to make up about half the PAs. Then I found that the OPS for the remaining players was about 100 points higher than what is was before. I was about to use that as my increase for the league OPS but then I noticed all the players left at the top of the OPS ranking were OFers and 1B men. I was going to end up eliminating alot more than half of the 2B, SS and catchers. So then I had to break things down by position in both eras.

In the AL from 1925-41, the players at each position who had 100 or more PAs made up about 90% of all the PAs in the ERA. This is higher than the 72% share I had from the 1988-2009 era. It would have been nice to have a higher share in the latter period but that would have added alot of time to figuring out who whas white or not.

I also used both leagues from 1988-2009 since Randy Johnson pitched in both while Grove only pitched in the AL.

I also used the improvement in fielding since 1947 that I had calculated a few years ago, not just 1988-2009. This saved alot of work. It could be that the fielding improvement is greater from 1988-2009 than over the entire period of integration. But I do list how Grove's ERA+ would end up under different fielding scenarios and the differences are not great.

On the fielding adjustments, I know that it should involve more than just putouts by OFers and assists by SS and 2B men, but my guess is that this will cover the bulk of any improvement. Maybe some day I can incorporate DPs, errors, etc.

I will try to post a list of all the players I used from 1988-2009 and whether they were designated as white or non-white. Check back to see when I do that.

Click here to see that list.


bobm said...

Very interesting analysis.

I conclude something slightly different from your analysis.

From your data, the absence of segregation in baseball could arguably have produced an era equivalent to a higher run scoring environment due to the imbalance from added hitting talent outweighing added pitching and fielding talent. So, the league ERA would have increased.

However, the league-wide increases in hitting talent (and fielding talent) should only affect Grove's ERA, but not his ERA+. Otherwise, it's "double counting" the impact of hitting on Grove's ERA+. Only the influx of pitching talent should affect his ERA+, his standing relative to other pitchers. Everyone faces the same distribution of hitters (roughly).

(I can believe that replacing the bottom 20% as you describe would raise the average by about 6%, as it did for qualifying pitchers from 2000-2009.)

Counting only the impact of the pitching talent influx--the 5.7% improvement in league average ERA independent of changes in hitting--Grove's new ERA+ would then be (4.42/1.057) divided by 3.06 = 137. This is still above Randy Johnson's ERA+.

As an aside, if you look at (actual, gradual) integration in time series, the opposite seemed to have occurred. Run scoring in baseball was high but flat and then decreasing through Grove's career. I find it interesting that, if you take out the WWII years, I think the run scoring trend in generally down, not up, for the next 20 years.

Cyril Morong said...


Thanks for dropping by and commenting.

I disagree on the double counting issue. You would be right if each pitcher's ERA went up by the same %. Certainly Grove's differential is not going to change. But if each pitcher goes up by the same absolute amount, then Grove's relative ERA falls since 3/4.5 is less than 3.5/5.

To get around this issue, I tried increasing his OPS by both the same absolute amount as I thought the league average would rise as well as the same percentage increase in OPS. My result was similar in both cases. Of course, that all assumed a linear relationship between OPS and runs, which may not be true. But my guess is that even if the relationship is non-linear, it is not very much so and the result would not be affected very much.

Think about a league where Grove pitches a shutout every game and the league ERA is 1.5. His relative ERA is 0 (or infinite if you take the reciprocal). Now we add in a bunch of new, good hitters. If each pitcher's ERA goes up by .5, then Grove's relative ERA has to fall.

I wish I knew why run scoring fell in the beginning years of integration. It seems like it should have been the opposite since the non-whites have played a bigger role as hitters than pitchers. There might be some other factors at work.


bobm said...


I see your point about adjusting ERA, and ERA+, for the change in the run scoring environment. BB-reference can adjust a player's stats for different environments, assuming that each pitcher faces a constant number of batters (before fatigue sets in or pitch counts come into play).

I think ERA+ understates how much more dominant Grove was in his time than Johnson was. The ERA+ calculation is apparently sensitive enough for even a small rise in league scoring to wipe out Grove's advantage in ERA+.

That said, it's hard to argue that Grove's dominance over the AL, as I have looked at in terms of standard deviation in ERA, would have deteriorated to the level which Johnson had over his leagues.

Also, in 17 seasons, Grove led the 8-team league in ERA 9 times, and was top 5 3 other times. In 22 seasons, Johnson led 14/16 team leagues 4 times, and was top 5 4 other times. Maybe Grove's (and Ruth's) apparent dominance has as much to do with league size as with segregation.

Both Johnson and Grove pitched in high run-scoring environments. Is it easier to dominate in high scoring environments than low? (Maybe I'll look at the standard score for the ERA league leader and regress against changes with league R/G and number of teams.)

Cyril Morong said...

I found that Grove had 12 top 5 finishes in SO/BB while Johnson had 11. Grove beats him in top 10 finishes 13-11. I tend to agree that Grove was more dominant and still would be after adjustments.

And here is something I posted to the SABR List in early 2006:

"I thought it would be interesting to use a point system to see how well pitchers have done in RSAA (runs saved above average and it is park adjusted). A first place finish would be 10 points, second 9, and so on. Ties would split points. A tie for first would get 9.5. Then I called up the annual top tens for the AL, NL and AA using the Lee Sinins Sabermetric Encyclopedia. Each pitcher got his points then a career total was found for each guy. Here are the top 10

Cy Young-134
W. Johnson-111.5
R. Johnson-83

Here is something I did with FIP ERA and standard deviations

The two are close.

Here is something else I did with standard deviations. You just have to look for each guy's name

Cyril Morong said...

here is that last link