Solved – Calculating the mean squared predicted error for OLS in R

least squareslinear modelmultiple regressionrregression

n <- 1000
betas <- c(1.2, 3, -2)
X <- cbind(var1 = rnorm(n, 0, 1), var2 = rnorm(n, 0, 2), var3 = rnorm(n, 0, 1.5))
set.seed(1)
epsilon <- rnorm(n, 0, 1)
y <-  X %*% betas + epsilon
data <- data.frame(cbind(y, X))
colnames(data) <- c("y", "var1", "var2", "var3")
> head(data)
          y        var1       var2       var3
1  9.983021  0.07730312  1.7000869 -2.7082253
2 -3.690355 -0.29686864 -1.8506260 -1.0170610
3  4.526042 -1.18324224  1.7871624 -0.7100371
4 -7.119478  0.01129269 -1.8820195  1.5411256
5  6.545304  0.99160104  1.0779042 -0.8960814
6 -3.479102  1.59396745 -0.3639488  1.7397741

I generate data from $Y = X\beta + \epsilon$. Assume that $X$ is fixed, and the only thing that is random is $\epsilon$. I divide the data set in half: half for training and half for testing.

# Split the data set into training and test set
train <- data[1:(n/2), ]
test <- data[(n/2 + 1): n, ]
test_X <- test[, names(test) != "y", drop = FALSE]

I fit an OLS model and calculate the empirical prediction error and get a value of 1.120582.

# Fit OLS model and calculate empirical prediction error
model <- lm(y ~ var1 + var2 + var3 - 1, data = train)
> mean((test$y - predict(model, newdata = test_X))^2)
[1] 1.120582

I then calculate the MSPE according to its formula:

\begin{align*}
E||Y_{test} – X_{test} \hat{\beta}_{OLS}||_2^2 &= Var(X_{test}\hat{\beta}_{OLS}) + N_{test} * \sigma^2_{\epsilon}\\
&= tr\left(X_{test} Cov(\hat{\beta}_{OLS}) X_{test}^T\right) + N_{test} * \sigma^2_{\epsilon}
\end{align*}

In this case, $N = 1000,$ so $N_{test} = N/2 = 500$, and $\sigma^2_{\epsilon} = 1$. Using the above formula, I get an MSPE of 503.0804, which is very different from the 1.120582 I found above. I understand that MSPE is an expected value, i.e., ideally I would generate many data sets and average across their prediction error. However, the difference between 503.0804 and 1.120582 is so big that it leads me to think that I might have written down the formula or coded it wrong. Any thoughts?

# Calculate the MPSE 
> sum(diag(as.matrix(test_X) %*% as.matrix(vcov(model)) %*% as.matrix(t(test_X)))) + nrow(test_X) * sigma2
[1] 503.0804

Best Answer

Good question. Let $(y,x)$ be your test data with $y = x^T \beta^* + \varepsilon$ and assume that $x$ is random as well (independent of $\varepsilon$). To simplify, let $$\delta = \beta^* - \widehat \beta.$$ Then, your mean-squared prediction error is \begin{align*} \mathbb E [(y - x^T \widehat \beta)^2 \mid \mathcal T] &= \mathbb E [(x^T \beta^* - x^T \widehat \beta + \varepsilon)^2 \mid \mathcal T] \\ &= \mathbb E[(x^T\delta ^2 + \varepsilon^2 \mid \mathcal T] \\ &= \delta^T \Sigma_x \delta + \sigma^2 \end{align*} where $\mathcal T$ is the training data and $\Sigma_x$ is the covariance of $x$ assuming it is zero mean, i.e., $\Sigma_x = \mathbb E xx^T$.

In your example, $\sigma^2 = 1$ and $$ \Sigma_x = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 2^2 & 0 \\ 0 & 0 & 1.5^2 \end{pmatrix}. $$

Note that covariance of $\widehat\beta$ does not come into play.


What do you actually compute in practice via the training and test splits? Suppose you have $y_i = x_i^T \beta^* + \varepsilon_i$ and let $y = (y_1,\dots,y_n)$ and $X$ have rows $x_i^T$. I am assuming that these are all test data, i.e., $(y,X)$ is the test data.

What does the following line of code compute?

mean((test$y - predict(model, newdata = test_X))^2)

This code only averages over the test data. It computes the following \begin{align*} \frac1n \| y - X \widehat \beta\|^2 &= \frac1n \| X \beta^* - X \widehat \beta + \varepsilon\|^2 \\ &= \frac1n \| X \delta + \varepsilon\|^2 \\ &= \delta^T \frac{X^T X}{n} \delta + 2 \delta^T \Big(\frac{1}{n} X^T \varepsilon \Big) + \frac{\|\varepsilon\|^2}{n}. \end{align*} Let $$\widehat \Sigma_x = \frac{1}{n} X^T X, \quad \text{ and } \widehat \Sigma_{x \varepsilon} = \frac{1}{n} X^T \varepsilon.$$ Then, \begin{align*} \frac1n \| y - X \widehat \beta\|^2 &= \delta^T \widehat \Sigma_x \delta + 2 \delta^T \widehat \Sigma_{x\varepsilon} + \frac{\|\varepsilon\|^2}{n}. \end{align*} If you would like to assume $x_i$ are nonrandom, you can for example, define your prediction error as $$ \frac1n \mathbb E \| y - X \widehat \beta\|^2 = \delta^T \widehat \Sigma_x \delta + \sigma^2 $$ since $\mathbb E [X^T \varepsilon] = X^T \mathbb E [\varepsilon]$. In the above argument the expectation is w.r.t. to the test data (conditional on the training effectively.) The only difference here is that you are using the sample covariance matrix of the test data $\widehat \Sigma_x$ instead of the population level covariance $\Sigma_x$.

Depending on what you are willing to assume fixed versus random, you get different approximations, form exact (i.e., both $X$ and $y$ of test are considered fixed) to the more accurate one where only $X$ is fixed to the less accurate one where you take the expectation over both $x$ and $y$.


But what about the slide 4 linked in the comments? Covariance of $\widehat \beta$ does not come into play, unless you want to repeat the training many many times and take an average over those as well (but this is not what your code is doing ... you are only training once). In case you repeat the experiment by averaging over many training/test splits then you can evaluate $$ \mathbb E (\delta^T \widehat \Sigma_x \delta ) = \mathbb E (\text{tr}( \widehat \Sigma_x \delta \delta^T)) = \text{tr}( \widehat \Sigma_x \mathbb E \delta \delta^T) = \text{tr}( \widehat \Sigma_x \text{cov}(\widehat \beta)). $$ where $\mathbb E$ is now over the training data.

Related Question