Solved – Forecasting optimization techniques in fantasy baseball

hypothesis testingmachine learningoptimizationtime series

I am currently trying to build a better forecasting model for my fantasy baseball roster. I currently am using commonly accepted projected season statistics (ZiPS from Fangraphs) to determine the average fantasy points a player can be expected to contribute per game. This is problematic, however, because it does not take into account variance in player performance (among other things).

Since baseball involves both luck and skill, I don't think it is a useful exercise to try and predict any particular statistic of a game (i.e. how many hits Prince Fielder will have). Instead, I would like to project average point contributions but take variance into account while doing so.

The first thing that comes to mind is the effect of the opposing pitcher. My hypothesis is that the quality of the pitcher effects opposing player performance. Given two players to choose from which are relatively equal in projected fantasy point averages, how can I quantify the effect of the opposing pitcher and how can I test the hypothesis?

Also, how can I consider variance in a reasonable way even though it is uncertain? How would I actually know if my projections are under performing or over performing? (This seems to be similar to a financial portfolio optimization problem)

Best Answer

Accounting for variance

There's a lot to think about in optimising a lineup for fantasy sports. You're right that expectation and variance are huge parts of it. Naively it would seem that expected points earned is all that matters. However, certain contests will reward only the first place player out of thousands -- meaning you want very high variance teams so that your tail is fat enough to make a win possible*. Other contests reward all those in the top half equally, in which case you care more about expectation and less about having a tail -- you just want to get the maximum amount of your expected distribution above the halfway cutoff.

For some intuition on how your lineup ties to variance, there are a couple of mechanisms. Certain players have a higher variance than others. Players on teams who never lose, for instance, tend to have pretty low variance. The main effect comes from picking players on the same team or picking players on opposing teams. Players on the same team have very high covariance -- they tend to all win or lose points together. Players facing each other have negative covariance -- either one does well or the other, but rarely both or neither. (This is magnified for certain positions, like striker vs opposing goalie in hockey or, I imagine, pitches vs opposing batters in baseball). Lineups featuring players with a high covariance to one another are lineups with a high variance.

You're right that calculating the variance of a team has analogous to portfolio optimisation. The problem is exactly the same: assets are now players, expected returns are now expected points and variance is variance.

Calculating the variance for a lineup given a set of players with known covariance is easy. If we are considering $N$ players, define $p$ as a binary vector of length $N$ with value 1 if a player is present on a team and 0 otherwise. $\Sigma$ is the $N*N$ covariance matrix of all players, then the variance of the lineup $\sigma$ is given by:

$ \sigma = p^T \Sigma p $

Where $^T$ is the transposition.

Covariance should not be calculated at an individual player level since there will never be enough data. Pitcher A's covariance with Batter B across an entire season is not useful, since they won't play each other in every game. Instead calculate it on a position level: the covariance of home team pitchers versus visiting shortstops, home team shortstops versus visiting pitchers, home team shortstops versus home team pitchers, etc. (Disclaimer: opinion, you could also calculate it purely as home vs away, or purely as home vs away and pitchers vs non-pitchers.)

Effects of opposing players

To the expectation-predicting point of your post, you should absolutely use information from the pitcher to predict the performance of a batter. Fantasy aficionados call this "defense against position" or "fantasy points against". Essentially the idea is to use the historic point earnings of batters who have faced this pitcher to predict future performance for all batters who face them. It's a hugely powerful predictor, and you can validate as you would any other model change.

* There's a really cool paper on a bunch of guys from MIT optimising drafts for very top-heavy fantasy sport contests where they plan to make multiple entries and want to maximize the expectation of their entire multiple entry strategy. They care about having high variance within the lineups they enter, but also low correlation between their entries. They treat it as a max coverage problem and solve it with integer programming. It's a good read.

Related Question