[Math] Expected test error in regression

linear regressionmean square errorregression

I am unsure regarding the definition of the expected test error here. As far as I understand the definition it is the following.

In a linear model the relationship between the random response variable $Y_i$ and the predictor vector $x_{i}$ is assumed to be of the following form

$$ Y_i = x^T_{i}\beta + \epsilon_i $$

where $\epsilon_i$ has expected value zero and variance $\sigma^2$.
Let $\hat{\beta}$ be the least square estimator fitted by using a training data set $(x_1,y_1),…,(x_n,y_n)$. Now we obtain a new instance $(x,y)$ from the same source as the instances in the training data set. Of course the observation $y$ is again an observation on a random variable $Y$.

The expected test error according to above source is now:

$$ \mathbb{E}[(y-x^T\beta^*)^2] $$

The above source now claims that

$$ \mathbb{E}[(y-x^T\beta^*)^2] = \mathbb{E}[(y-x^T\beta)^2] + \mathbb{E}[(x^T\beta)^2-x^T\beta)^2] $$

and further

$$ \mathbb{E}[(y-x^T\beta)^2] = \sigma^2 $$

Now the last claim is not clear to me. What would have been clear to me is that

$$ \mathbb{E}[(Y-x^T\beta)^2] = \sigma^2 $$ where $Y$ is the random variable rather than the observation $y$ on that random variable. Hence the question that arises is whether the expected test error is

$$ \mathbb{E}[(Y-x^T\beta^*)^2] \quad \text{ rather than } \quad \mathbb{E}[(y-x^T\beta^*)^2]. $$ Hence the question is really about whether to use the random variable $Y$ in the expected error or the observation $y$ on this random variable.

Best Answer

For the expected error you should use the random variable $Y$, otherwise, $(y_i- x_i'\beta)^2$ is a constant for every $i$. However, in reality you cannot calculate the expected value of some r.v., thus you are estimating $\sigma^2$ using $\frac 1 n \sum_{i=1}^n (y_i - x_i'\beta)^2 $.

Related Question