These two expressions disagree, as you note, in terms of the use of the residuals in calculation: the difference of $Y$ from the predicted values, being included or omitted from the calculation of the standard errors. They are indeed different estimators but they converge to the same thing in the long run. They also can be combined to create a "sandwich" estimator.
To revisit some basic modeling assumptions: the weighted linear regression model is estimated from a weighted estimating equation of the form:
$$U(\beta) = \mathbf{X}^T \mathbf{W}\left( Y - \mathbf{X}^T\beta\right)$$
When $\mathbf{W}$ is just the diagonal matrix of weights. This estimating equation is also the normal equations (partial log likelihood) for the MLE. Then, the expected information is:
$$\mathbf{A}= \frac{\partial U(\beta)}{\partial \beta} = \mathbf{X}^T\mathbf{W} \mathbf{X}$$
Then $\mathbf{A}^{-1}$ is a consistent estimator of the covariance matrix for $\beta$ when 1. the mean model is appropriately specified and 2. the weights are the inverse variance of the residuals. You have already stated the A matrix is your first display. Contrast this with the observed information:
$$\mathbf{B} = E[U(\beta)U(\beta)^T] = \mathbf{X}^T \mathbf{W}E((Y-\mathbf{X}\beta)^T(Y-\mathbf{X}\beta)) \mathbf{W}\mathbf{X} $$
One of the weight matrices can multiply with the squared errors and factor out of the expression as a constant because it is orthogonal to the $\mathbf{X}$, and you'll note that is the expression for $\sigma_e^2= \sum_{i=1}^n w_i (y_i - a - bx_i)/(n-2)$. $\mathbf{B}$ is also a consistent estimator of the information matrix, but will disagree with $\mathbf{A}$ in finite samples.
As for which one to use, why not use both? A sandwich estimator is obtained by $\left(\mathbf{A}^T\mathbf{B}\mathbf{A}\right)^{-1}$ and depends neither on the mean model being correct nor on the weights being properly specified.
Note $Var(\hat{\beta}_0) = Var(\bar{y} - \hat{\beta}_1\bar{x}) = Var(\bar{y}) + \bar{x}^2Var(\hat{\beta}_1) - 2Cov(\bar{y},\hat{\beta}_1)$. Try to show that the covariance term is 0.
The $Var(\hat{\mu}) = \dfrac{\sigma^2}{n}$ fact (although I'm not a fan of the notation they used here) is used in the calculation, $Var(\bar{y}) = \dfrac{\sigma^2}{n}$.
Best Answer
Sure a worse fit gives larger residuals, which gives a larger standard error estimate. But this is not why we compute the standard error estimate. If all you care about is fit, you don't need to bother with the standard error of $\hat \beta$ at all. Just use $\hat\sigma^2$ directly, which is the MSE of your model. We use $\text{s.e}(\hat \beta)$ for inference: to say something about how well we have estimated $\beta$. In short to construct confidence intervals around our point estimate.
If we want to trust these intervals and their relation to the population of $\beta$s, $\text{s.e.}(\hat\beta)$ should be correct. It should be the standard deviation of whatever distribution $\hat \beta$ has. If the model is specified correctly, we can use the standard formulas to estimate $\text{s.e.}(\hat\beta)$ directly, and we know exactly which properties $\hat\beta$ has. These properties and formulas — and hence our inferences about $\hat\beta$ — are directly derived from the model assumptions. If the model isn't correctly specified, all our beautiful and clean theory goes out the window.
Elsewhere in the ISLR book you can read that an approximate 95% confidence interval for $\hat \beta$ is
$$[\hat\beta - 2\cdot\text{s.e.}(\hat\beta), \hat\beta + 2\cdot\text{s.e.}(\hat\beta)].$$
This is true if you have a good standard error estimate, but if the estimate is too small/large, the confidence interval is too tight/wide. You can no longer have 95% confidence in your 95% confidence interval.
A simulation study
Below is an illustration in code and figures. The examples in ISLR use some data that look quadratic. I will simulate some data so that we know what the truth is. My true model is $y = 4 + 5x -3x^2 + \epsilon$, where $\epsilon \sim N(0,1)$. This is classic linear regression, the big assumption is that errors should conditional on $x$ be iid from a normal distribution with mean zero. A quadratic model is the perfect fit. A linear model is a poor fit.
The figure below shows some simulated data in grey, the true model in black, and a fitted linear regression in red. The errors are not mean-zero normal conditional on $x$: they are consistently below zero to the right and to the left, and consistently above zero in the middle.
To explore the behavior of a bootstrap estimate of s.e. as compared to the standard parametric estimate I have included a small simulation. First I generate a data set similar to above, I then estimate standard error for the coefficient of $x$ with both the parametric assumptions in
lm()
and with the bootstrap. This gives us a i) distribution over how the standard estimate behaves here and ii) a distribution over how the bootstrap estimate behaves. I also calculate the "true" s.e. of $\hat \beta_x$ by repeatedly generating data and calculating the beta. I can then take the standard deviation of all these betas as the truth.Created on 2019-10-25 by the reprex package (v0.2.1)
Standard estimate is the solid line, bootstrap is the dashed line, in red we see the truth. The standard estimate greatly underestimates the and the bootstrap somewhat underestimates. The bootstrap gets much closer on average and at least some times is in the right area.