Solved – What does the variance of an estimator for a regression parameter mean

estimatorsregressionvariance

I may be asking dumb or non-sensical question, but what does the variance of an estimator for a regression parameter (e.g. $\beta_{0}, \beta_{1}$) mean? How does it even have variance? Isn't it a constant estimate of a presumed true but unknown constant value?

I have seen a good mathematical derivation of it here from which I can see for example that
$$Var(\beta_{1}) = \frac{\sigma^{2}}{\sum{(x_{i} – \overline{x})^{2}}} = \frac{\sum{(y_{i} – \overline{y})^{2}}}{(n-1)\sum{(x_{i} – \overline{x})^{2}}} $$
but it is the practical understanding of it that is eluding me.

We don't know from our sample if we are right or wrong, for all we know (even though there is variance in the Y's) we may have come up with the exact true estimators, yet we calculate a variance for them based on variation of sampled X's and Y's from their means …. ???

Best Answer

Estimators are functions of the data, treated as random variables

In classical statistics, the regression parameters $\beta_0$ and $\beta_1$ are considered to be constants, and they do not have any variance. However, you estimate these parameter using an estimator that is a function of the data in the regression model. In the case of a simple linear regression, this data consists of an explanatory vector $\mathbf{x} = (x_1,...,x_n)$ and a corresponding response vector $\mathbf{y} = (y_1,...,y_n)$. Under standard OLS estimation the functional forms of the estimators can be written as linear functions of the response variables, as:

$$\hat{\beta}_0 (\mathbf{x}, \mathbf{y}) = \sum_{i=1}^n \Bigg[ \frac{\sum_j x_j (x_j - x_i)}{n \sum_j (x_j -\bar{x})^2} \Bigg] \cdot y_i \quad \quad \quad \quad \quad \hat{\beta}_1 (\mathbf{x}, \mathbf{y}) = \sum_{i=1}^n \frac{(x_i -\bar{x})}{\sum_j (x_j -\bar{x})^2} \cdot y_i.$$

Since these parameter estimators are functions of the data, when the data is considered as random, the estimators themselves are also random variables. Thus, they have a distribution, and moments, including a mean and variance. Below I will derive the variance of each of these estimators, but the fact that they have a (non-zero) variance is a consequence of the fact that, as estimators, they are functions of the data, conceived in its random variable form.


Variance of the parameter estimators: In the context of regression analysis it is usual to proceed conditional on the explanatory variables, and so these are considered as fixed constants. However, even with this assumption, the response variable is still random (since it is affected by the error term in the regression model), and so the estimators are still random variables. We would then look at their distribution conditional on the explanatory values, and the moments of this distribution. The variances of the estimators are given respectively by:

$$\begin{equation} \begin{aligned} \mathbb{V}(\hat{\beta}_0 | \mathbf{x}) = \mathbb{V}(\hat{\beta}_0 (\mathbf{x}, \mathbf{Y}) | \mathbf{x}) &= \mathbb{V} \Bigg( \sum_{i=1}^n \Bigg[ \frac{\sum_j x_j (x_j - x_i)}{n \sum_j (x_j -\bar{x})^2} \Bigg] \cdot Y_i \Bigg| \mathbf{x} \Bigg) \\[6pt] &= \sum_{i=1}^n \Bigg[ \frac{\sum_j x_j (x_j - x_i)}{n \sum_j (x_j -\bar{x})^2} \Bigg]^2 \cdot \mathbb{V} ( Y_i | \mathbf{x} ) \\[6pt] &= \sum_{i=1}^n \Bigg[ \frac{\sum_j x_j (x_j - x_i)}{n \sum_j (x_j -\bar{x})^2} \Bigg]^2 \cdot \sigma^2 \\[6pt] &= \sigma^2 \cdot \Bigg[ \frac{\sum_i (\sum_j x_j (x_j - x_i))^2}{n^2 (\sum_j (x_j -\bar{x})^2)^2} \Bigg] \\[6pt] &= \sigma^2 \cdot \Bigg[ \frac{\sum_i \sum_j \sum_k x_j (x_j - x_i) x_k (x_k - x_i)}{n^2 (\sum_j (x_j -\bar{x})^2)^2} \Bigg] \\[6pt] &= \sigma^2 \cdot \Bigg[ \frac{\sum_i \sum_j \sum_k (x_j^2 x_k^2 - x_i x_j x_k^2)}{n^2 (\sum_j (x_j -\bar{x})^2)^2} \Bigg] \\[6pt] &= \sigma^2 \cdot \Bigg[ \frac{(\sum_k x_k^2) \sum_i \sum_j x_i (x_j - x_i)}{n^2 (\sum_j (x_j -\bar{x})^2)^2} \Bigg] \\[6pt] &= \sigma^2 \cdot \Bigg[ \frac{(\sum_k x_k^2) n \sum_j (x_j - x_i)^2}{n^2 (\sum_j (x_j -\bar{x})^2)^2} \Bigg] \\[6pt] &= \sigma^2 \cdot \Bigg[ \frac{(\sum_k x_k^2)}{n \sum_j (x_j -\bar{x})^2} \Bigg] \\[6pt] &= \frac{\sigma^2 \sum_i x_i^2}{n \sum_i (x_i -\bar{x})^2}. \\[6pt] \mathbb{V}(\hat{\beta}_1 | \mathbf{x}) = \mathbb{V}(\hat{\beta}_1 (\mathbf{x}, \mathbf{Y}) | \mathbf{x}) &= \mathbb{V} \Bigg( \sum_{i=1}^n \frac{(x_i -\bar{x})}{\sum_j (x_j -\bar{x})^2} \cdot Y_i \Bigg| \mathbf{x} \Bigg) \\[6pt] &= \sum_{i=1}^n \Bigg[ \frac{(x_i -\bar{x})}{\sum_j (x_j -\bar{x})^2} \Bigg]^2 \cdot \mathbb{V} ( Y_i | \mathbf{x} ) \\[6pt] &= \sum_{i=1}^n \Bigg[ \frac{(x_i -\bar{x})}{\sum_j (x_j -\bar{x})^2} \Bigg]^2 \cdot \sigma^2 \\[6pt] &= \sigma^2 \cdot \sum_{i=1}^n \Bigg[ \frac{(x_i -\bar{x})}{\sum_j (x_j -\bar{x})^2} \Bigg]^2 \\[6pt] &= \sigma^2 \cdot \frac{\sum_i (x_i -\bar{x})^2}{(\sum_i (x_i -\bar{x})^2)^2} \\[6pt] &= \frac{\sigma^2}{\sum_i (x_i -\bar{x})^2} . \\[6pt] \end{aligned} \end{equation}$$

Note that in both cases, the variability in the estimator (conditional on the explanatory values) is coming from the variability in the response variables, which are random variables within the regression model. (It is also worth noting that the asserted variance in your question is obviously wrong - it contains values of the response variable rather than treating these as random variables that impact on the variance of the estimator.)

Related Question