Solved – Why is regression about variance

interpretationregressionvariance

I am reading this note.

On page 2, it states:

"How much of the variance in the data is explained by a given regression model?"

"Regression interpretation is about the mean of the coefficients; inference is about their variance."

I have read about such statements numerous times, why would we care about "how much of the variance in the data is explained by the given regression model?"… more specifically, why "variance"?

Best Answer

why would we care about "how much of the variance in the data is explained by the given regression model?"

To answer this it is useful to think about exactly what it means for a certain percentage of the variance to be explained by the regression model.

Let $Y_{1}, ..., Y_{n}$ be the outcome variable. The usual sample variance of the dependent variable in a regression model is $$ \frac{1}{n-1} \sum_{i=1}^{n} (Y_i - \overline{Y})^2 $$ Now let $\widehat{Y}_i \equiv \widehat{f}({\boldsymbol X}_i)$ be the prediction of $Y_i$ based on a least squares linear regression model with predictor values ${\boldsymbol X}_i$. As proven here, this variance above can be partitioned as:
$$ \frac{1}{n-1} \sum_{i=1}^{n} (Y_i - \overline{Y})^2 = \underbrace{\frac{1}{n-1} \sum_{i=1}^{n} (Y_i - \widehat{Y}_i)^2}_{{\rm residual \ variance}} + \underbrace{\frac{1}{n-1} \sum_{i=1}^{n} (\widehat{Y}_i - \overline{Y})^2}_{{\rm explained \ variance}} $$

In least squares regression, the average of the predicted values is $\overline{Y}$, therefore the total variance is equal to the averaged squared difference between the observed and the predicted values (residual variance) plus the sample variance of the predictions themselves (explained variance), which are only a function of the ${\boldsymbol X}$s. Therefore the "explained" variance may be thought of as the variance in $Y_i$ that is attributable to variation in ${\boldsymbol X}_i$. The proportion of the variance in $Y_i$ that is "explained" (i.e. the proportion of variation in $Y_i$ that is attributable to variation in ${\boldsymbol X}_i$) is sometimes referred to as $R^2$.

Now we use two extreme examples make it clear why this variance decomposition is important:

(1) The predictors have nothing to do with the responses. In that case, the best unbiased predictor (in the least squares sense) for $Y_i$ is $\widehat{Y}_i = \overline{Y}$. Therefore the total variance in $Y_i$ is just equal to the residual variance and is unrelated to the variance in the predictors ${\boldsymbol X}_i$.
(2) The predictors are perfectly linearly related to the predictors. In that case, the predictions are exactly correct and $\widehat{Y}_i = Y_i$. Therefore there is no residual variance and all of the variance in the outcome is the variance in the predictions themselves, which are only a function of the predictors. Therefore all of the variance in the outcome is simply due to variance in the predictors ${\boldsymbol X}_i$.

Situations with real data will often lie between the two extremes, as will the proportion of variance that can be attributed to these two sources. The more "explained variance" there is - i.e. the more of the variation in $Y_i$ that is due to variation in ${\boldsymbol X}_i$ - the better the predictions $\widehat{Y}_{i}$ are performing (i.e. the smaller the "residual variance" is), which is another way of saying that the least squares model fits well.

Related Solutions

Solved – Variance explained from factors for step-wise regression in R

You're unlikely to get legitimate answers to your questions using stepwise algorithms to select predictors. For details on that topic you could search for "variable selection" on this site. If you're willing and able to use a more intentional/focussed way of choosing variables, then R's relaimpo (relative importance) package should be very helpful. Its calc.relimp command calculates the change in r-squared for each predictor when the predictor is entered last (its [part r] squared, a.k.a. squared semipartial correlation) -- and/or when it is entered first (its zero-order r squared). A basic statement is

calc.relimp( mymodel, type = c("last", "first") )

Solved – Sum of variances from regression coefficients, larger then total variance. Why

Let $X_1$ and $X_2$ denote two random variables with variances $\sigma^2_{X1}$ and $\sigma^2_{X2}$. Let $Y = X_1 + X_2$. The variance of $Y$ is then equal to $\sigma^2_{X1} + \sigma^2_{X2} + \rho \sigma_{X1} \sigma_{X2}$, where $\rho$ is the correlation between $X_1$ and $X_2$. So, unless the two variables are uncorrelated, the sum of the variances will not be equal to variance of $Y$. Depending on the sign of $\rho$, the actual variance of $Y$ may be larger or smaller than the sum of $\sigma^2_{X1}$ and $\sigma^2_{X2}$.

Best Answer

Related Solutions

Solved – Variance explained from factors for step-wise regression in R

Solved – Sum of variances from regression coefficients, larger then total variance. Why

Related Question