Solved – Measuring Regression to the Mean in Hitting Home Runs

Anyone that follows baseball has likely heard about the out-of-nowhere MVP-type performance of Toronto's Jose Bautista. In the four years previous, he hit roughly 15 home runs per season. Last year he hit 54, a number surpassed by only 12 players in baseball history.

In 2010 he was paid 2.4 million and he's asking the team for 10.5 million for 2011. They're offering 7.6 million. If he can repeat that in 2011, he'll be easily worth either amount. But what are the odds of him repeating? How hard can we expect him to regress to the mean? How much of his performance can we expect was due to chance? What can we expect his regression-to-the-mean adjusted 2010 totals to be? How do I work it out?

I've been playing around with the Lahman Baseball Database and squeezed out a query that returns home run totals for all players in the previous five seasons who've had at least 50 at-bats per season.

The table looks like this (notice Jose Bautista in row 10)

     first     last hr_2006 hr_2007 hr_2008 hr_2009 hr_2010
1    Bobby    Abreu      15      16      20      15      20
2   Garret Anderson      17      16      15      13       2
3  Bronson   Arroyo       2       1       1       0       1
4  Garrett   Atkins      29      25      21       9       1
5     Brad   Ausmus       2       3       3       1       0
6     Jeff    Baker       5       4      12       4       4
7      Rod  Barajas      11       4      11      19      17
8     Josh     Bard       9       5       1       6       3
9    Jason Bartlett       2       5       1      14       4
10    Jose Bautista      16      15      15      13      54

and the full result (232 rows) is available here.

I really don't know where to start. Can anyone point me in the right direction? Some relevant theory, and R commands would be especially helpful.

Thanks kindly

Tommy

Note: The example is a little contrived. Home runs definitely aren't the best indicator of a player's worth, and home run totals don't consider the varying number of chances per season that a batter has the chance to hit home runs (plate appearances). Nor does it reflect that some players play in more favourable stadiums, and that league average home runs change year over year. Etc. Etc. If I can grasp the theory behind accounting for regression to the mean, I can use it on more suitable measures than HRs.

Best Answer

I think that there's definitely a Bayesian shrinkage or prior correction that could help prediction but you might want to also consider another tack...

Look up players in history, not just the last few years, who've had breakout seasons after a couple in the majors (dramatic increases perhaps 2x) and see how they did in the following year. It's possible the probability of maintaining performance there is the right predictor.

There's a variety of ways to look at this problem but as mpiktas said, you're going to need more data. If you just want to deal with recent data then you're going to have to look at overall league stats, the pitchers he's up against, it's a complex problem.

And then there's just considering Bautista's own data. Yes, that was his best year but it was also the first time since 2007 he had over 350 ABs (569). You might want to consider converting the percentage increase in performance.

Best Answer

Related Solutions

Solved – Time series analysis on login data to forecast CPU demand using R

Related Question