You can simulate parameter estimates using the mvtnorm
package with mean vector coef(an1)
and covariance matrix vcov(an1)
and then summarise them. Or you could bootstrap.
However, it's probably easier just to use the multcomp package to examine the contrast. In your model that would be:
library(multcomp)
cont <- glht(an1, linfct="x1 - x2 = 0")
summary(cont) ## estimate, standard error, z-statistic and p-value
confint(cont) ## confidence interval
These two expressions disagree, as you note, in terms of the use of the residuals in calculation: the difference of $Y$ from the predicted values, being included or omitted from the calculation of the standard errors. They are indeed different estimators but they converge to the same thing in the long run. They also can be combined to create a "sandwich" estimator.
To revisit some basic modeling assumptions: the weighted linear regression model is estimated from a weighted estimating equation of the form:
$$U(\beta) = \mathbf{X}^T \mathbf{W}\left( Y - \mathbf{X}^T\beta\right)$$
When $\mathbf{W}$ is just the diagonal matrix of weights. This estimating equation is also the normal equations (partial log likelihood) for the MLE. Then, the expected information is:
$$\mathbf{A}= \frac{\partial U(\beta)}{\partial \beta} = \mathbf{X}^T\mathbf{W} \mathbf{X}$$
Then $\mathbf{A}^{-1}$ is a consistent estimator of the covariance matrix for $\beta$ when 1. the mean model is appropriately specified and 2. the weights are the inverse variance of the residuals. You have already stated the A matrix is your first display. Contrast this with the observed information:
$$\mathbf{B} = E[U(\beta)U(\beta)^T] = \mathbf{X}^T \mathbf{W}E((Y-\mathbf{X}\beta)^T(Y-\mathbf{X}\beta)) \mathbf{W}\mathbf{X} $$
One of the weight matrices can multiply with the squared errors and factor out of the expression as a constant because it is orthogonal to the $\mathbf{X}$, and you'll note that is the expression for $\sigma_e^2= \sum_{i=1}^n w_i (y_i - a - bx_i)/(n-2)$. $\mathbf{B}$ is also a consistent estimator of the information matrix, but will disagree with $\mathbf{A}$ in finite samples.
As for which one to use, why not use both? A sandwich estimator is obtained by $\left(\mathbf{A}^T\mathbf{B}\mathbf{A}\right)^{-1}$ and depends neither on the mean model being correct nor on the weights being properly specified.
Best Answer
Time your unscaled co-variance matrix the dispersion paramter as done in
summary.glm
. The relevant code fromsummary.glm
isThe
chol2inv(Qr$qr[p1, p1, drop = FALSE])
computes $(R^\top R)^{-1}=(X^\top WX)^{-1}$ which you make a comment about. Here, $R$ is the upper triangular matrix from the QR decomposition $QR=\sqrt{W}X$.atiretoo answers only holds when the dispersion paramter is one as with the Poisson and Binomial distribution.