Solved – Correct standard errors for weighted linear regression

rregression

What is the correct way to calculate the standard errors of the coefficients in a weighted linear regression?

The regression equation I am using is $y_i = a + bx_i$, and I have weights, $w_i = 1/\sigma_i$. The numerical recipes formula for a straight line fit, and the formula given in "An introduction to error analysis" by J. R Taylor, (and Wikipedia too) state that the standard error in the $b$ coefficient is calculated as $$\sigma_b = \sqrt{\frac{\sum w_i}{\sum w_i\sum w_i x_i^2-(\sum w_i x_i)^2}}$$ (or alternatively in matrix form the standard errors are, $\sigma^2 = (X'WX)^{-1}$). This formula can be derived from propagation of errors.

Using R's $lm()$ function (and python's StatsModels), I get a standard error in the $b$ coefficient which appears* to be calculated as $$\sigma_b = \sigma_e\sqrt{\frac{\sum w_i}{\sum w_i\sum w_i x_i^2-(\sum w_i x_i)^2}}$$
where $\sigma_e^2 = \sum w_i(y_i – a – bx_i)^2/(N-2)$ (alternatively, $\sigma^2 = \sigma_e^2(X'WX)^{-1}$ ). So they are the same, except for the $\sigma_e$ multiplier in R and StatsModel.

Is it possible that these actually different measures that are just being called the same thing? Is one preferred over the other for an estimate of the standard error?

*I say "appears" because I couldn't find the actual formula anywhere.

edited because I had omitted the weight terms in the denominators.

Best Answer

These two expressions disagree, as you note, in terms of the use of the residuals in calculation: the difference of $Y$ from the predicted values, being included or omitted from the calculation of the standard errors. They are indeed different estimators but they converge to the same thing in the long run. They also can be combined to create a "sandwich" estimator.

To revisit some basic modeling assumptions: the weighted linear regression model is estimated from a weighted estimating equation of the form:

$$U(\beta) = \mathbf{X}^T \mathbf{W}\left( Y - \mathbf{X}^T\beta\right)$$

When $\mathbf{W}$ is just the diagonal matrix of weights. This estimating equation is also the normal equations (partial log likelihood) for the MLE. Then, the expected information is:

$$\mathbf{A}= \frac{\partial U(\beta)}{\partial \beta} = \mathbf{X}^T\mathbf{W} \mathbf{X}$$

Then $\mathbf{A}^{-1}$ is a consistent estimator of the covariance matrix for $\beta$ when 1. the mean model is appropriately specified and 2. the weights are the inverse variance of the residuals. You have already stated the A matrix is your first display. Contrast this with the observed information:

$$\mathbf{B} = E[U(\beta)U(\beta)^T] = \mathbf{X}^T \mathbf{W}E((Y-\mathbf{X}\beta)^T(Y-\mathbf{X}\beta)) \mathbf{W}\mathbf{X} $$

One of the weight matrices can multiply with the squared errors and factor out of the expression as a constant because it is orthogonal to the $\mathbf{X}$, and you'll note that is the expression for $\sigma_e^2= \sum_{i=1}^n w_i (y_i - a - bx_i)/(n-2)$. $\mathbf{B}$ is also a consistent estimator of the information matrix, but will disagree with $\mathbf{A}$ in finite samples.

As for which one to use, why not use both? A sandwich estimator is obtained by $\left(\mathbf{A}^T\mathbf{B}\mathbf{A}\right)^{-1}$ and depends neither on the mean model being correct nor on the weights being properly specified.