It appears you know the whole technique, so I will only deal with the specific question - and the answer is:
You should use the sample sizes that relate to the "explanatory" variables - I guess this is what you mean by "independent", i.e. of the regressors.
This comes out of the following:
In a regression setting, we make assumptions about the error term conditional on the regressors: namely in a (matrix notation) model
$$ \mathbf y = \mathbf X\beta + \mathbf u$$
we specify $E(\mathbf u \mid \mathbf X) = 0,\; E(\mathbf u \mathbf u'\mid \mathbf X) = \sigma^2\mathbf I$.
In a heteroskedastic setting, we essentially think of conditional heteroskedasticity, namely, we assume (or suspect) that $$E(\mathbf u \mathbf u'\mid \mathbf X) = \sigma^2\mathbf \Omega $$
Since the expected value is conditional on $\mathbf X$, it will be a function of $\mathbf X$, (and not of $\mathbf y$ which is included in the error term), i.e. $\mathbf \Omega = g(\mathbf X) $.
So for logical consistency, you must use characteristics related to the regressors in order to theoretically model heteroskedasticity, and then apply weighted least-squares.
You can do this with the lm
and associated functions, but you need to be a little careful about how you construct your weights.
Here's an example / walkthrough. Note that the weights are normalized so that the average weight = 1. I'll follow with what happens if they aren't normalized. I've deleted a lot of the less relevant printout associated with various functions.
x <- rnorm(1000)
y <- x + rnorm(1000)
wts <- rev(0.998^(0:999)) # Weights go from 0.135 to 1
wts <- wts / mean(wts) # Now we normalize to mean 1
> summary(unwtd_lm <- lm(y~x))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.04238 0.031 ---
x 1.03071 0.03268 31.539 <2e-16 ***
Residual standard error: 1.01 on 998 degrees of freedom
> summary(wtd_lm <- lm(y~x, weights=wts))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.03436 0.03227 1.065 0.287
x 1.03869 0.03295 31.524 <2e-16 ***
Residual standard error: 1.02 on 998 degrees of freedom
You can see that with this much data we don't have much difference between the two estimates, but there is some.
Now for your question. It's not clear whether you want the distance in standard errors where the standard errors are for fitted values or for prediction, so I'll show both. Let us say we are doing this for the value $x = 1$ and the target value (green dot) $y = 1.1$):
> y_eval <- 1.10
> wtd_pred <- predict(wtd_lm, newdata=data.frame(x=1), se.fit=TRUE)
> # Distance relative to predictive std. error
> (y_eval-wtd_pred$fit[1]) / sqrt(wtd_pred$se.fit^2 + wtd_pred$residual.scale^2)
[1] 0.02639818
>
> # Distance relative to fitted std. error
> (y_eval-wtd_pred$fit[1]) / wtd_pred$se.fit
[1] 0.5945089
where I've deleted the warning message associated with predictive confidence intervals and weighted model fits.
Now I'll show you how to do the residual variance calculation. First, if your weights aren't normalized, you will have problems:
> wts <- rev(0.998^(0:999))
> summary(wtd_lm <- lm(y~x, weights=wts))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.03436 0.03227 1.065 0.287
x 1.03869 0.03295 31.524 <2e-16 ***
Residual standard error: 0.6707 on 998 degrees of freedom
> predict(wtd_lm, newdata=data.frame(x=1), interval="prediction")
fit lwr upr
1 1.073049 -0.2461643 2.392262
Note how that residual standard error has gone way down and the prediction confidence interval has really changed, but the coefficient estimates themselves have not. This is because the calculation for the residual s.e. divides by the residual degrees of freedom (998 in this case) without regard for the scale of the weights. Here's the calculation, mostly lifted from the interior of summary.lm
:
w <- wtd_lm$weights
r <- wtd_lm$residuals
rss <- sum(w * r^2)
sqrt(rss / wtd_lm$df)
[1] 0.6707338
which you can see matches the residual s.e. in the previous printout.
Here's how you ought to do this calculation if you find yourself in a position where you need to do it by hand, so to speak:
> rss_w <- sum(w*r^2)/mean(w)
> sqrt(rss_w / wtd_lm$df)
[1] 1.019937
However, normalizing the weights up front takes care of the need to divide by mean(w)
and the various lm
-related calculations come out correctly without any further manual intervention.
Best Answer
Those weights ($w_i$ that are based on the predicted values $\hat{y}_i$) relate to quasi-likelihood estimates for generalized linear models (GLM). In those quasi-likelihood cases you take the freedom (at the cost of an exact likelihood computation) to only express the relation between the mean and variance, rather than fully specifying the specific error distribution.
E.g. for Poisson regression or binomial regression the mean and variance are implicitly equal, $\mu = V$, which is a too strong restriction when the model for the errors is not exactly Poisson or Binomial (for instance it can be, instead, some over-dispersed case of the Poisson or Binomial distribution). With quasi models you do not 'care' about the (exact) distribution and just define explicitly $\mu = c V$ (that multiplicative factor makes it less restrictive) and pretend solving a real likelihood function as if it was for an exactly known distribution.
By adjusting weights according to some function of the (predicted) outcome $w_i = 1/f(\hat{y}_i)$ you are correcting hetero-scedasticity like using a relation between the mean and variance $V \propto f(\mu)$, but without knowing the exact distribution.
The second case $w_i = 1/\hat{y}_i$ you might use if you expect/assume (overdispersed) Poisson, Binomial, Chi-squared (and there's possible more) error distributions (which have a linear relation $\mu = c V$ between mean and variance).
The third case $w_i = (1/\hat{y}_i)^2$ you might use for (overdispersed) exponentially distributed errors. (which have a quadratic relation $\mu = c V^2$ between mean and variance).
The first case, which seems to use some linear function of the absolute residuals, allows a much more flexible, approximation and is (I guess) used on an ad-hoc basis for more 'stranger' distributions.
You could see it as approximating the error distribution by a Gaussian distribution with $\epsilon_i \sim N(0,f(Y_i))$