Solved – Why is regression about variance

interpretationregressionvariance

I am reading this note.

On page 2, it states:

"How much of the variance in the data is explained by a given regression model?"

"Regression interpretation is about the mean of the coefficients; inference is about their variance."

I have read about such statements numerous times, why would we care about "how much of the variance in the data is explained by the given regression model?"… more specifically, why "variance"?

Best Answer

why would we care about "how much of the variance in the data is explained by the given regression model?"

To answer this it is useful to think about exactly what it means for a certain percentage of the variance to be explained by the regression model.

Let $Y_{1}, ..., Y_{n}$ be the outcome variable. The usual sample variance of the dependent variable in a regression model is $$ \frac{1}{n-1} \sum_{i=1}^{n} (Y_i - \overline{Y})^2 $$ Now let $\widehat{Y}_i \equiv \widehat{f}({\boldsymbol X}_i)$ be the prediction of $Y_i$ based on a least squares linear regression model with predictor values ${\boldsymbol X}_i$. As proven here, this variance above can be partitioned as:
$$ \frac{1}{n-1} \sum_{i=1}^{n} (Y_i - \overline{Y})^2 = \underbrace{\frac{1}{n-1} \sum_{i=1}^{n} (Y_i - \widehat{Y}_i)^2}_{{\rm residual \ variance}} + \underbrace{\frac{1}{n-1} \sum_{i=1}^{n} (\widehat{Y}_i - \overline{Y})^2}_{{\rm explained \ variance}} $$

In least squares regression, the average of the predicted values is $\overline{Y}$, therefore the total variance is equal to the averaged squared difference between the observed and the predicted values (residual variance) plus the sample variance of the predictions themselves (explained variance), which are only a function of the ${\boldsymbol X}$s. Therefore the "explained" variance may be thought of as the variance in $Y_i$ that is attributable to variation in ${\boldsymbol X}_i$. The proportion of the variance in $Y_i$ that is "explained" (i.e. the proportion of variation in $Y_i$ that is attributable to variation in ${\boldsymbol X}_i$) is sometimes referred to as $R^2$.

Now we use two extreme examples make it clear why this variance decomposition is important:

  • (1) The predictors have nothing to do with the responses. In that case, the best unbiased predictor (in the least squares sense) for $Y_i$ is $\widehat{Y}_i = \overline{Y}$. Therefore the total variance in $Y_i$ is just equal to the residual variance and is unrelated to the variance in the predictors ${\boldsymbol X}_i$.

  • (2) The predictors are perfectly linearly related to the predictors. In that case, the predictions are exactly correct and $\widehat{Y}_i = Y_i$. Therefore there is no residual variance and all of the variance in the outcome is the variance in the predictions themselves, which are only a function of the predictors. Therefore all of the variance in the outcome is simply due to variance in the predictors ${\boldsymbol X}_i$.

Situations with real data will often lie between the two extremes, as will the proportion of variance that can be attributed to these two sources. The more "explained variance" there is - i.e. the more of the variation in $Y_i$ that is due to variation in ${\boldsymbol X}_i$ - the better the predictions $\widehat{Y}_{i}$ are performing (i.e. the smaller the "residual variance" is), which is another way of saying that the least squares model fits well.