Cybermetrics: June 2008

Saturday, June 21, 2008

Who Are The Good Leadoff Men?

It seems obvious: Hitters who are fast and get on base alot. You also probably don't want someone who hits alot of HRs, since you want those guys to bat with runners on. So I tried to devise a stat that would capture this. Here it is:

(2B + 1.25*3B - HR + SB)/outs

In other words, how many times a player gets into scoring position per out. Since triples are worth about 25% more than 2B's according to run expectancy tables, I multiply them by 1.25. By dividing by outs, the ability to get on base is taken into account since if you make an out you don't reach base. Also, outs include caught stealing. By subtracting HRs I am saying that guys that hit alot of HRs, even though they may have other good leadoff traits, are "penalized" here, since they might be better suited to batting lower in the order. But I also ran the numbers without subtracting HRs (the correlation between the two different formulas for all the players in the study was about .86). The table below shows the top 15 from 2007 among players with 400+ PAs using both methods.

The players in the top 15 are probably not big surprises. But are they really that great at being leadoff men? Do they increase team runs by batting leadoff comapared to anyone else? To try to answer these questions, I turned to some analysis I did on lineups two years ago. You can read those articles here and here. In that research, I studied the impact on team scoring by what each slot in the lineup did. In the latter of those two articles, team runs per game was the dependent variable in a linear regression while walk%, hit%, extra-base%, SB per game and CS per game were the independent variables. The regression found a run value for each event and for each lineup slot.

I plugged in the values for those events for Jose Reyes for the number one slot to see what impact he would have on team runs per game. But I also did the same for Adam Dunn, a player who you probably would not think of as making a good leadoff man. In my rankings above, he is 208th out of 216 players. In fact, I tried both Reyes and Dunn in the leadoff slot and both in the clean up slot. The table below shows their relevant stats and the run values for each lineup slot.

If Reyes bats first, his numbers combine to make 1.326 while if Dunn bats 4th we get 1.453 (the regression had an intercept or constant equal to about -5, so to get a number for team runs per game I would have to plug in numbers for all slots, multiply things out then subtract 5-the numbers here are just individual contributions). So those two add up to 2.779. But what if Dunn batted first and Reyes batted 4th? Dunn gets 1.521 and Reyes gets 1.306 for a total of 2.827. That is actually better than having Reyes bat first and Dunn 4th. Your team would score .0485 more runs per game or about 7.86 more per season. The reason it happens this way is that Dunn walks more (101 vs. 30) and if you went to one of my links above, you can see that the run value for walks is highest for the leadoff slot.

Now if a team really tried this, Dunn might not get walked so much since he won't be as big a threat batting with the bases empty. But if the guys right behind him don't have much power and since he is not fast, they might walk him more. Reyes might not get as many extra base hits since some of his triples and doubles are a result of speed and with runners on base he might have someone clogging the bases. I looked at his career stats on that and the results are mixed. It is also possible that Dunn would not score on hits that would have scored Reyes and since some of Reyes' doubles are a result of speed more than hitting distance, his doubles might drive in fewer runs than Dunn's doubles. Would it make a 7.86 run difference over the course of a season? Maybe, but even if it did, it is still interesting that batting Dunn first and Reyes fourth, instead of vice-versa, does not seem to hurt scoring that much, even though Reyes is rated far better as a leadoff man by my measure (which seems to make some sense).

I also tried using David Pinto's optimal lineup finder, based on my lineup research. I set a lineup with Reyes batting first and Dunn 4th. Then I used Retrosheet data to fill in the rest of the lineup. I used the OBP & SLG of each lineup slot for the NL in 2007. This tool has two methods, each based on my two separate lineup studies that used different years. Having Reyes batting first and Dunn 4th with everyone else being league average for their slot generated 4.93 to 4.94 runs per game. But the tool in each case did find that Reyes should bat leadoff. In one case it had Dunn batting 4th which generated 5.04 runs per game. In the other case it had Dunn 2nd for 4.99 runs per game. In the two cases where I had Dunn first and Reyes 4th, the runs per game were 4.90 and 4.93 (as stated above, the reverse yielded 4.93 and 4.94).

Now that model did not include stealing. But again, even though Reyes batting first and Dunn 4th does better than vice-versa, it is not by much. If stealing were included in Pinto's tool, it would be a bigger difference. But recall that in the model with things broken down by hits, walks and extrabases, Dunn batting first did better.

Then I ran a simulation using the Star Simulator. I plugged in all the numbers for each lineup slot again using Retrosheet data (2007 NL). The simulation had the average team scoring about 754 runs per season, about 2% less than in real life. But it also had about 2% fewer ABs (maybe because it only does offense and does not have extra inning games). Then I put Reyes first and Dunn 4th. The team scored 793.8 runs per season. If it were reversed, it was 791.5. So the difference, although in favor of Reyes batting first, is only 1.8 runs over a season. Having Reyes bat first with an average cleanup hitter, it was 773.36. With Dunn batting leadoff it was 788.27! So having Adam Dunn instead of Jose Reyes as your leadoff hitter would means about 15 more runs per season.

If you go back to my earlier analysis, from the second table, if we just multiply out the impact of Dunn batting first and Reyes batting first, we get 1.52 for Dunn and 1.33 for Reyes. Over 162 games that difference of about .019 is about 31.6 runs!

This all seems to be about tradeoffs. Getting on base versus speed and having a high OBP guy bat lead off versus losing his power if he batted in the middle of the lineup. I am looking for a way to incorporate all those factors in to find the optimal leadoff man. So I tried one more thing. I calculated each guy's impact in batting leadoff (like the way I did using the second table). So each player has a leadoff impact. But even if someone gets a high score there, it might not be a good idea to bat them first since you might lose an even better score or impact from another slot they might bat in. So I found each guy's impact in all nine slots. Then that got subtracted from their leadoff impact.

Barry Bonds, for example, had a leadoff impact of 1.77. His impact in the number 2 slot was 1.72. So he is .05 better batting leadoff than 2nd. He was .23 better number at 1 than number 3. I did that all the way down to the number 9 slot. Here are all of Bonds' differences

0.05
0.23
0.13
0.20
0.32
0.47
0.40
0.59

That adds up to about 2.4. Then I added up all of those differences for each player and ranked them from highest to lowest. Remember, that I am taking into account not just how good they would be leading off, but how much better (or worse) they would be than batting elsewhere. Below are the top 15 leadoff men from last year, even taking into account what you would lose by not having them bat elsewhere (based on walks, hits, extrabases, SB and CS)

Barry Bonds
Todd Helton
David Ortiz
Jorge Posada
Jack Cust
Pat Burrell
Magglio Ordonez
Jim Thome
Chipper Jones
Carlos Pena
Albert Pujols
Travis Hafner
Scott Hatteberg
Kevin Youkilis
David Wright

Just to be complete, here are the top 15 (based on walks, hits, extrabases, SB and CS) while not adjusting for how well they would hit elsewhere. It is some of the same players as above,but not identical

Barry Bonds
David Ortiz
Alex Rodriguez
Magglio Ordonez
Chipper Jones
Carlos Pena
Albert Pujols
Matt Holliday
Jorge Posada
David Wright
Prince Fielder
Chase Utley
Jim Thome
Todd Helton
Mark Teixeira

Saturday, June 14, 2008

Chipper Jones And Batting .400

Both Tangotiger and Phil Birnbaum blogged about this after there was an article at Baseball Prospectus. One issue is what is Jones' "true" average. Here is my comment at Phil's site:

In the last 4 months last year, Jones batted .354. Just doing a cursory look at his Retrosheet stats, that comes pretty close to what his best 4 month period (within one season) might have been anytime in his career. But it was over only 353 ABs.

Anyway, combining the last 4 months from last year with this year, he has batted .378 over his last 580 ABs. And his yearly AVGs have been going up lately. Starting with 2004, here are his averages with his age

.248 (32)
.296 (33)
.324 (34)
.337 (35)

This does not seem like a normal aging/performance pattern. How unusual, I don't know. But I just wonder if something is going on or changing with this guy that makes it really hard to know his true ability.

So I decided to look at what .400 hitters were like in the past. There are two tables below, one for the guys from the 1800s and one for the guys since 1900. The tables show their age, the league average, their career average, their career average before the year they hit .400 and their average before the year they hit .400.

One thing you might notice is that the average age is about 27 for both groups. Jones is 36. Cobb batted .400 when he was 35 but the league average was .285 that year, much higher than the NL average this year of .259. Cobb also had batted .400 twice before, had a much higher career average than Jones and batted alot higher the year before.

Barnes batted .400 in 1876 but I think that was the year you bunt a ball that went foul before it got to third or first base and it was still a hit. Dunlap did it in the only year of the Union Association. Notice that Jones' career average and average last year are far below what was normal in the past for .400 hitters. Same for the league average. Those are all just simple averages and in the case of Joe Jackson, he had only 75 ABs in his previous year and 115 ABs previously in his career. Taking him out would not change things much, with the last two columns being .344 and .377.

Besides Cobb, the only other guy to bat .400 since 1900 while being at least 30 years old was Bill Terry. But he did when the league average was .303. So when it comes to age, leage average and previous performance, Jones is not even close to what .400 hitters were in the past. If he does it, it will be amazing. I will wonder how he did it.

Tuesday, June 10, 2008

Have Second Basemen Been Underpaid?

It seems like 2nd basemen get paid less than they should. In trying to explain player salaries when taking their hitting performance, position and free agent/arbitration status into account, I found that being a 2nd baseman had a negative impact. I was doing a study involving salaries looking for something else and I thought it would be a good idea to have dummy variables for the "skilled" positions. What I found is discussed below. If you are interested in this issue, you might want to read a couple of papers by Jahn K. Hakes and Raymond D. Sauer (one was published in the Journal of Economic Perspectives). References to those papers are at the end.

I looked at 5 years, 1985, 1990, 1995, 2000 and 2005. I used regression analysis to predict each player's salary. The data set included all the players with 400+ plate appearances in that year. Here is the basic model or equation

SAL = Constant + b1*FA + b2*ARB + b3*2B + b4*SS + b5*3B + b6*CF + b7*C + b8*HITS + b9*XB + b10*BB

FA means the player had played long enough to be a free agent or had been granted free agency (in some cases I found that out from Retroseet). The salary data came from the SABR "Business of Baseball" site. Players with 3 years service (and some with 2) can be eligible for arbitration. So ARB is for those guys. Both FA and ARB are dummy variables, 1 or 0. The same is true for the "skill" positions. 2B is for second basemen, SS for shortstops and so on.

I broke down hitting performance into three variables: hits, extrabases(XB) and walks (BB). This measures 3 different types of abilities (as in the work of Hakes and Sauer). XB means 1 for a double, 2 for a triple and 3 for a HR or all bases over 1 on a hit. So there are three abilities: to get a hit, to hit for power and to get walks.

I actually ran two versions of the model. One was a linear regression and the other was non-linear, where I took the natural log of salary (called LOGSAL). The results are summarized in the tables below. The first table shows the linear results and the second one shows the non-linear results (LOGSAL). You can click on the tables to get a bigger version. The values for each variable are the coefficient estimates. * means it was significant at the 10% level, ** the 5% level and *** the 1% level. It is probably not a big surprise that FA, ARB, HITs, XB and BB are all very significant in both the linear and non-linear models. With higher r-squared values and F values, the non-linear model looks like a better fit.

Being a FA in 2005 meant about an extra $5.7 million in salary, everything else being equal, in the linear model. It is hard to see an exact value in the non-linear model for being a FA. But I simply changed the 1 to 0 for a few guys to see how their predicted salary would change. If Alex Rodriquez was not a FA (or eligible for ARB) in 2005, his predicted salary would have been just $1.1 million. The model predicted it would be $12.2 million. So, for him, being a free agent, meant an extra $11.1 million. For Brad Ausmus, it meant about $7.6 million (or almost all of his $8.3 million in salary). Since the regression is non-linear, the effect is not the same for everyone.

Now what about the 2nd basemen? The coefficient for them is negative in all years in the non-linear regression and negative in 4 of the 5 years for the linear regression. If being a 2nd basemen truly has no effect on salary, the odds of getting a negative sign all 5 years is 32 to 1. The only year it was positive in the linear model was 2000, and it only added about $60,000 in salary. It was only significant in one case, in the non-linear model in 2005. The coefficient is -.334, so it is hard to see the dollar value of the loss to being a 2nd basemen. The linear model shows it to be $996,000 (although it was not significant, with the p-value being .22). In the non-linear model, I switched the 1 to a 0 for all the 2nd basemen for 2005. For 11 out 20 of them, it meant a drop of more than $1 million (again, like the the FA case mentioned above for AROD and Ausmus, the effect is not the same for each player in the non-linear model). That is, if those guys had been in LF, RF or at 1B, they would be making about $1 million more. For 6 others, the drop was six figures but those were guys who had salaries under $1 million anyway, so it was big portion of their salary.

The story might not be that much different for some of the other skilled positions. The coefficient for CF was negative in all 5 years of the non-linear model. Same for 3B and it was significant in the year 2000. The results, however, for catchers (C) and SS are mixed. Sometimes negative, sometimes positive. But it looks like there has been a general tendency to underpay players at the skilled positions. In fact, in the non-linear regression, 20 of the 25 coefficients (5 per year for 5 years) are negative.

Other Work
Here are the papers by Hakes and Sauer

An Economic Evaluation of the Moneyball Hypothesis

he Moneyball Anomaly and Payroll Efficiency: A Further Investigation

Saturday, June 7, 2008

Should Sox Manager Guillen Have Been Upset?

After losing a third straight game to the Rays last Sunday, Guillen had a tirade, complaining about how players were not hitting and coming through with runners on base. Especially frustrating was scoring just 4 runs in the 3 losses and leaving 10 runners on base in a 4-3, 10 inning loss. He swore and wanted Sox GM Ken Williams to get some better players. But were the Sox underperforming?

Let's first look at what the Sox were expected to do and what they have been doing. The table below shows the OBP and SLG for the 9 Sox regulars and what they were projected to do in the Bill James Handbook.

Alexi Ramirez was not included because he did not have a projection in the book. A weighted average of the projected OBPs and SLGs is .343 and .460 (guys like Quentin and Swisher were projected for parks other than U. S. Cellular Field, so that is a problem, but I hope not too big). This year, the league OBP and SLG are lower than last year. For OBP, it has fallen from .338 to .330 and for SLG it has fallen from .423 to .402. So if we give the Sox the same declines for their projection, they end up wtih .335 and .439 for their projections. How many runs per game should that bring them?

Based on regression analysis of all teams from 2001-03,

R/G = 17.11*OBP + 11.13*SLG - 5.66

That predicts that the Sox should score 4.95 runs per game. Right now they are at 4.51. So that is almost half a run less per game than expected, a big disappointment. Their actual OBP and SLG right now are .330 and .416, also below projections. At those numbers, they should be scoring 4.62 runs per game but the actually are scoring 4.51. So that is a little lower than they should have, but nothing major.

Before the three game losing streak started in Tampa Bay about a week and a half ago, their OBP and SLG were about 0.324 and 0.416. So they should have been scoring 4.51 R/G while they were actually at 4.44. So the team was scoring about what they should have. So Guillen should not have been upset about scoring enough runs or leaving runners on base. He could have been upset about players not hitting up to expectations as shown above, but that is it. Also, the game they got shutout in in Tampa was started by Scott Kazmier, who went 7 IP and is one of the best pitchers in baseball. And Tampa has good pitching in general. Sometimes they will give up just 4 runs in three games. You can't go crazy when that happens.

On the flip side, the Sox have pitched better than expectations. Below are the ERAs for the Sox pitchers this year and their projections. Nick Masset is left out because he did not have projections in the book.

The weighted average of the projected ERAs for these pitchers is 4.45. But the league ERA is .37 lower this year than last year, so we can lower the prediction to 4.08. The Sox actually have a league leading 3.33 ERA, far lower than expected. So they are allowing .75 fewer earned runs than expected (and they were scoring .44 less than expected). So on balance, they are .31 ahead of expectations.

Are they winning the number of games they should this year based on their runs and runs allowed this year? They have scored 275 runs and allowed 226 runs. Using the Bill James "pythagorean projection," that works out to a .597 pct and 35.8 wins. They have won 34. So a bit of a disappointment, but not huge. Last week I showed that the Sox had won about 1 game less than expected based on their OPS differential.

As I stated above, the Sox were expected to be scoring 4.95 runs a game this year. Their expected ERA was 4.08. Add about .4 to that for unearned runs, and you get 4.48 runs allowed per game. Scoring that many runs and allowing that many runs would give them a .549 pct and about 33 wins, 1 less than they actually have. So overall, the Sox are doing about as well as expected when the season started and they are scoring about as many runs as they should based on their OBP and SLG. They are also winning pretty close to the number of games they should based on both OPS differential and runs and runs allowed. So what is there to complain about?

Cybermetrics