Solved – Forecasting optimization techniques in fantasy baseball

hypothesis testingmachine learningoptimizationtime series

I am currently trying to build a better forecasting model for my fantasy baseball roster. I currently am using commonly accepted projected season statistics (ZiPS from Fangraphs) to determine the average fantasy points a player can be expected to contribute per game. This is problematic, however, because it does not take into account variance in player performance (among other things).

Since baseball involves both luck and skill, I don't think it is a useful exercise to try and predict any particular statistic of a game (i.e. how many hits Prince Fielder will have). Instead, I would like to project average point contributions but take variance into account while doing so.

The first thing that comes to mind is the effect of the opposing pitcher. My hypothesis is that the quality of the pitcher effects opposing player performance. Given two players to choose from which are relatively equal in projected fantasy point averages, how can I quantify the effect of the opposing pitcher and how can I test the hypothesis?

Also, how can I consider variance in a reasonable way even though it is uncertain? How would I actually know if my projections are under performing or over performing? (This seems to be similar to a financial portfolio optimization problem)

Best Answer

Accounting for variance

There's a lot to think about in optimising a lineup for fantasy sports. You're right that expectation and variance are huge parts of it. Naively it would seem that expected points earned is all that matters. However, certain contests will reward only the first place player out of thousands -- meaning you want very high variance teams so that your tail is fat enough to make a win possible*. Other contests reward all those in the top half equally, in which case you care more about expectation and less about having a tail -- you just want to get the maximum amount of your expected distribution above the halfway cutoff.

For some intuition on how your lineup ties to variance, there are a couple of mechanisms. Certain players have a higher variance than others. Players on teams who never lose, for instance, tend to have pretty low variance. The main effect comes from picking players on the same team or picking players on opposing teams. Players on the same team have very high covariance -- they tend to all win or lose points together. Players facing each other have negative covariance -- either one does well or the other, but rarely both or neither. (This is magnified for certain positions, like striker vs opposing goalie in hockey or, I imagine, pitches vs opposing batters in baseball). Lineups featuring players with a high covariance to one another are lineups with a high variance.

You're right that calculating the variance of a team has analogous to portfolio optimisation. The problem is exactly the same: assets are now players, expected returns are now expected points and variance is variance.

Calculating the variance for a lineup given a set of players with known covariance is easy. If we are considering $N$ players, define $p$ as a binary vector of length $N$ with value 1 if a player is present on a team and 0 otherwise. $\Sigma$ is the $N*N$ covariance matrix of all players, then the variance of the lineup $\sigma$ is given by:

$ \sigma = p^T \Sigma p $

Where $^T$ is the transposition.

Covariance should not be calculated at an individual player level since there will never be enough data. Pitcher A's covariance with Batter B across an entire season is not useful, since they won't play each other in every game. Instead calculate it on a position level: the covariance of home team pitchers versus visiting shortstops, home team shortstops versus visiting pitchers, home team shortstops versus home team pitchers, etc. (Disclaimer: opinion, you could also calculate it purely as home vs away, or purely as home vs away and pitchers vs non-pitchers.)

Effects of opposing players

To the expectation-predicting point of your post, you should absolutely use information from the pitcher to predict the performance of a batter. Fantasy aficionados call this "defense against position" or "fantasy points against". Essentially the idea is to use the historic point earnings of batters who have faced this pitcher to predict future performance for all batters who face them. It's a hugely powerful predictor, and you can validate as you would any other model change.

* _{There's a really cool paper on a bunch of guys from MIT optimising drafts for very top-heavy fantasy sport contests where they plan to make multiple entries and want to maximize the expectation of their entire multiple entry strategy. They care about having high variance within the lineups they enter, but also low correlation between their entries. They treat it as a max coverage problem and solve it with integer programming. It's a good read.}

Related Solutions

Hypothesis Testing – Conducting an F Test for Equality of Variances

There appears to be a difference in the interpretation of a statistical formula. One quick, simple, and compelling way to resolve such differences is to simulate the situation. Here, you have noted there will be a difference when the players play different numbers of games. Let's therefore retain every aspect of the question but change the number of games played by the second player. We will run a large number ($10^5$) of iterations, collecting the two versions of the $F$ statistic in each case, and draw histograms of their results. Overplotting these histograms with the $F$ distribution ought to determine, without any further debate, which formula (if any!) is correct.

Here is R code to do this. It takes only a couple of seconds to execute.

s <- sqrt((9 * 17312 + 9*13208) / (9 + 9))             # Common SD
m <- 375                                               # Common mean
n.sim <- 10^5                                          # Number of iterations
n1 <- 10                                               # Games played by player 1
n2 <- 3                                                # Games played by player 2
x <- matrix(rnorm(n1*n.sim, mean=m, sd=s), ncol=n.sim) # Player 1's results
y <- matrix(rnorm(n2*n.sim, mean=m, sd=s), ncol=n.sim) # Player 2's results
F.sim <- apply(x, 2, var) / apply(y, 2, var)           # S1^2/S2^2

par(mfrow=c(1,2))                                      # Show both histograms
#
# On the left: histogram of the S1^2/S2^2 results.
#
hist(log(F.sim), probability=TRUE, breaks=50, main="S1^2/S2^2")
curve(df(exp(x),n1-1,n2-1)*exp(x), add=TRUE, from=log(min(F.sim)),
   to=log(max(F.sim)), col="Red", lwd=2)
#
# On the right: histogram of the (S1^2/(n1-1)) / (S2^2/(n2-1)) results.
#
F.sim2 <- F.sim * (n2-1) / (n1-1)
hist(log(F.sim2), probability=TRUE, breaks=50, main="(S1^2/[n1-1])/(S2^2/[n2-1])")
curve(df(exp(x),n1-1,n2-1)*exp(x), add=TRUE, from=log(min(F.sim)),
   to=log(max(F.sim)), col="Red", lwd=2)

Although it is unnecessary, this code uses the common mean ($375$) and pooled standard deviation (computed as s in the first line) for the simulation. Also of note is that the histograms are drawn on logarithmic scales, because when the numbers of games get small (n2, equal to $3$ here), the $F$ distribution can be extremely skewed.

Here is the output. Which formula actually matches the $F$ distribution (the red curve)?

(The difference in the right hand side is so dramatic that even just $100$ iterations would suffice to show its formula has serious problems. Thus in the future you probably won't need to run $10^5$ iterations; one-tenth as many will usually do fine.)

If you like, modify this to fit some of the other examples you have looked at.

Solved – Do optimization techniques map to sampling techniques

One connection has been brought up by Max Welling and friends in these two papers:

The gist is that the "learning", ie. optimisation of a model smoothly transitions into sampling from the posterior.