Tuesday, June 10, 2008

Have Second Basemen Been Underpaid?

It seems like 2nd basemen get paid less than they should. In trying to explain player salaries when taking their hitting performance, position and free agent/arbitration status into account, I found that being a 2nd baseman had a negative impact. I was doing a study involving salaries looking for something else and I thought it would be a good idea to have dummy variables for the "skilled" positions. What I found is discussed below. If you are interested in this issue, you might want to read a couple of papers by Jahn K. Hakes and Raymond D. Sauer (one was published in the Journal of Economic Perspectives). References to those papers are at the end.

I looked at 5 years, 1985, 1990, 1995, 2000 and 2005. I used regression analysis to predict each player's salary. The data set included all the players with 400+ plate appearances in that year. Here is the basic model or equation

SAL = Constant + b1*FA + b2*ARB + b3*2B + b4*SS + b5*3B + b6*CF + b7*C + b8*HITS + b9*XB + b10*BB

FA means the player had played long enough to be a free agent or had been granted free agency (in some cases I found that out from Retroseet). The salary data came from the SABR "Business of Baseball" site. Players with 3 years service (and some with 2) can be eligible for arbitration. So ARB is for those guys. Both FA and ARB are dummy variables, 1 or 0. The same is true for the "skill" positions. 2B is for second basemen, SS for shortstops and so on.

I broke down hitting performance into three variables: hits, extrabases(XB) and walks (BB). This measures 3 different types of abilities (as in the work of Hakes and Sauer). XB means 1 for a double, 2 for a triple and 3 for a HR or all bases over 1 on a hit. So there are three abilities: to get a hit, to hit for power and to get walks.

I actually ran two versions of the model. One was a linear regression and the other was non-linear, where I took the natural log of salary (called LOGSAL). The results are summarized in the tables below. The first table shows the linear results and the second one shows the non-linear results (LOGSAL). You can click on the tables to get a bigger version. The values for each variable are the coefficient estimates. * means it was significant at the 10% level, ** the 5% level and *** the 1% level. It is probably not a big surprise that FA, ARB, HITs, XB and BB are all very significant in both the linear and non-linear models. With higher r-squared values and F values, the non-linear model looks like a better fit.

Being a FA in 2005 meant about an extra $5.7 million in salary, everything else being equal, in the linear model. It is hard to see an exact value in the non-linear model for being a FA. But I simply changed the 1 to 0 for a few guys to see how their predicted salary would change. If Alex Rodriquez was not a FA (or eligible for ARB) in 2005, his predicted salary would have been just $1.1 million. The model predicted it would be $12.2 million. So, for him, being a free agent, meant an extra $11.1 million. For Brad Ausmus, it meant about $7.6 million (or almost all of his $8.3 million in salary). Since the regression is non-linear, the effect is not the same for everyone.

Now what about the 2nd basemen? The coefficient for them is negative in all years in the non-linear regression and negative in 4 of the 5 years for the linear regression. If being a 2nd basemen truly has no effect on salary, the odds of getting a negative sign all 5 years is 32 to 1. The only year it was positive in the linear model was 2000, and it only added about $60,000 in salary. It was only significant in one case, in the non-linear model in 2005. The coefficient is -.334, so it is hard to see the dollar value of the loss to being a 2nd basemen. The linear model shows it to be $996,000 (although it was not significant, with the p-value being .22). In the non-linear model, I switched the 1 to a 0 for all the 2nd basemen for 2005. For 11 out 20 of them, it meant a drop of more than $1 million (again, like the the FA case mentioned above for AROD and Ausmus, the effect is not the same for each player in the non-linear model). That is, if those guys had been in LF, RF or at 1B, they would be making about $1 million more. For 6 others, the drop was six figures but those were guys who had salaries under $1 million anyway, so it was big portion of their salary.

The story might not be that much different for some of the other skilled positions. The coefficient for CF was negative in all 5 years of the non-linear model. Same for 3B and it was significant in the year 2000. The results, however, for catchers (C) and SS are mixed. Sometimes negative, sometimes positive. But it looks like there has been a general tendency to underpay players at the skilled positions. In fact, in the non-linear regression, 20 of the 25 coefficients (5 per year for 5 years) are negative.

Other Work
Here are the papers by Hakes and Sauer

An Economic Evaluation of the Moneyball Hypothesis

he Moneyball Anomaly and Payroll Efficiency: A Further Investigation


Will Dwinnell said...

Do you have other performance metrics, such as mean absolute error, or mean absolute percent error?

Cyril Morong said...


Thanks for dropping by and reading my blog.

When you say mean absolute error, or mean absolute percent error, do you mean if I plug each guy's data into the regression equation, get a fitted or predicted value and then see how much it differs from his actual salary? Then get the average of that for everyone?

Will Dwinnell said...

Yes, although it would be even more interesting to see that sort of performance measure on a holdout data set.

Cyril Morong said...

It might be a good idea to do it on another data set, but it took me alot of time to get just the 5 years set up. Here are the average absolute prediction errors for 2000 and 2005. First the linear regression and then the log regression

2000-2.1 million, 1.2 million
2005-2.3 million, 2 million

Will Dwinnell said...

You could test with the data you already have, by holding out, for instance, year 2005, or instead a random sample. Train on one portion of the data, and test on the other.

By testing on the data used for model development, your results are optimistically biased.