I first posted this to the "Beyond the Boxscore" site back in 2006. That site might not exist any more.
So here is the first post I did on this. I did others later on and I will probably add them here soon.
One question that often comes up is "what is the relative value of
on-base percentage (OBP) and slugging percentage (SLG)?" Is OBP 50% more
important than SLG? Or 60%? Or something else? A stat called OPS simply
adds the two, giving them equal weight. But maybe the weight should not
be equal. For example, here is the regression equation of team runs per
game for the years 2001-03:
R/G = 17.11*OBP + 11.13*SLG - 5.66
This makes OBP about 53% more important than SLG, a fairly typical
result. But it is possible that OBP might be more important for certain
positions in the lineup, like the leadoff batter. And for SLG, it might
be more important for the cleanup hitter. To check this out, I ran a
regression in which team runs per game was the dependent variable (DV)
and the OBP and SLG of each lineup slot as the independent variables
(IVs). OBP1 means the OBP of the leadoff batter, SLG3 means the SLG of
the third place hitter, etc. I used data from Retrosheet for the
1989-2002 seasons. Retrosheet shows the stats for each team by lineup
position. Below are the coefficient values for the IVs.
There is quite a variance. A point of OBP is worth about .003 runs per
game from the leadoff man (a .021 increase in the leadoff OBP would be
about .063 runs more per game or 10 for a whole season, which usually
means about 1 win) The value of OBP is much less for the number 8 man.
For the leadoff man, OBP is three times as important as SLG. For the
cleanup hitter, they are almost the same. So this analysis shows that
the relative values of OBP and SLG could be different depending on the
lineup position of the batter in question.
There could be multicollinearity in my analysis, meaning that the
coefficient estimates are not as reliable as they could be because IVs
are highly correlated with each other. I discuss what I did to detect
multicollinearity below. But if this were a problem, I tried a
different, but similar model where the IVs would likely be less
correlated with each other.
Each lineup slot had 3 variables: walk percentage, hit percentage and
extra-base percentage. For walks, hits, and extra-bases, the denominator
was plate appearances (PAs). This is a little different than comparing
OBP and SLG since OBP has PAs as the denominator and SLG has ABs. Also,
by using extra-bases, it is a little like isolated power. SLG is not
always as good measure of power because a guy who hits a single drives
up his SLG. Isolated power is SLG - AVG, or extra-bases divided by ABs.
Of course, here, I am using PAs. H1 is the hit% of the leadoff man, W1
is the walk% of the leadoff man, XB1 is the extra-base% of the leadoff
man, etc. Here are the coefficient estimates:
Again, there are some big differences. The value of a walk to the
leadoff man is twice what it is for the number 6 man. The cleanup hitter
has the highest extra-base value.
I did try some other variables. I had SBs and CS per game in the first
model with OBP and SLG. Things were generally fine there except that in a
couple of cases, the value of a CS was positive and in one case the
value of a SB was negative. Why some lineup slots would have negative
values for SBs or positive values for CS is not clear. I tried one
regression with just the AL since they have the DH and a regular player
bats ninth. The results seemed about the same. Email me if you want
those.
Multicollinearity. In the first model with OBP and SLG, most of the
correlations between the IVs were under .5. But some were higher and
they were all the OBP and SLG for corresponding lineup positions. The
correlation between OBP1 and SLG1 was .596. Those correlations ranged
from .596 to .739, except for OBP9 and SLG9, which was very high, at
.897. But in the second model, only one correlation between IVs was over
.5 and that was H9 and XB9 at .648. The vast majority of the others
were under .2.
Another way to check for multicollinearity is to run regressions in
which one IV is a function of all of the other IVs. In the first model
with OBP and SLG, the r-squared was generally in the .5-.6 range (that
was 18 regressions). R-squared tells us how what percentage of the
variation in the DV is explained by the model. There is a stat called
the "variance inflation factor" or VIF. It is 1/(1 - r-squared). So if
r-squared was .5, 1- .5 = .5. Then 1/.5 = 2. A couple of sources I
looked at suggested that if the VIF is under 10, multicollinearity is
not a problem. Most of these were about 2. One got close to 6 (that was
SLG9). I did come across one source that said there is no rule about the
value of VIF and multicollinearity.
For the second model, I only ran a couple of these regressions where one
IV depended on all the others. The first one was W1 and the r-squared
was only about .2. I tried XB9 (which corresponds a little to SLG9, the
one that was closest to being a problem in the other model) and the
r-squared was only about .4, which would mean a very low VIF of about
1.7.
Also, multicollinearity is supposed to be a problem where the standard
errors of the coefficient estimates are high. This makes it hard for the
estimates to be significant. But that was generally not the case here.
One thing I don't know about is that there might be some kind of joint
hypothesis about the VIF. Maybe if you have a large number of IVs it
only takes a certain number to have a VIF over 2 or something like that
for there to be a problem.