Solved – How to determine weights for WLS regression in R

multiple regressionrweighted-regression

I am trying to predict age as a function of a set of DNA methylation markers. These predictors are continuous between 0 and 100. When performing OLS regression, I can see that variance increases with age.

Thus, I decided to fit a weighted regression model. However, I am having trouble deciding how to define the weights for my model. I have used the fGLS method, like so:

OLSressq <- OLSres^2                 # Square residuals
lnOLSressq <- log(OLSressq)          # Take natural log of squared residuals
aux <- lm(lnOLSressq~X)              # Run auxillary model
ghat <- fitted(aux)                  # Predict g^
hhat <- exp(ghat)                    # Create h^
fGLS <- lm(Y~X, weights = 1/hhat)    # Weight is 1/h^

And these were my results:

Call:
lm(formula = Y ~ X, weights = 1/hhat)

Weighted Residuals:
    Min      1Q  Median      3Q     Max 
-4.9288 -1.2491 -0.1325  1.2626  5.1452 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 23.1009494  5.2299867   4.417 1.64e-05 ***
XASPA       -0.1441404  0.0474738  -3.036  0.00271 ** 
XPDE4C       0.6421385  0.0812891   7.899 1.83e-13 ***
XELOVL2     -0.2040382  0.0866564  -2.355  0.01951 *  
XELOVL2sq    0.0088532  0.0009381   9.438  < 2e-16 ***
XEDARADD    -0.1965472  0.0348989  -5.632 5.98e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.762 on 200 degrees of freedom
Multiple R-squared:  0.9687,    Adjusted R-squared:  0.9679 
F-statistic:  1239 on 5 and 200 DF,  p-value: < 2.2e-16

However, before figuring out how to perform the fGLS method, I was playing around with different weights just to see what would happen. I used 1/(squared residuals of OLS model) as weights and ended up with this:

Call:
lm(formula = Y ~ X, weights = 1/OLSressq)

Weighted Residuals:
    Min      1Q  Median      3Q     Max 
-1.0893 -0.9916 -0.7855  0.9998  2.0238 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 30.8756737  1.1355861   27.19   <2e-16 ***
XASPA       -0.1956188  0.0116329  -16.82   <2e-16 ***
XPDE4C       0.6168490  0.0102149   60.39   <2e-16 ***
XELOVL2     -0.1596969  0.0116723  -13.68   <2e-16 ***
XELOVL2sq    0.0078459  0.0001593   49.26   <2e-16 ***
XEDARADD    -0.2492048  0.0068751  -36.25   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1 on 200 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 1.133e+06 on 5 and 200 DF,  p-value: < 2.2e-16

Since the residual standard error is smaller, R² equals 1 (is that even possible?) and the F statistic is a lot higher, I am tempted to assume this model is better than what I achieved through the fGLS method. However, it seems to me that randomly picking weights through trial and error should always yield worse results than when you actually mathematically try to estimate the correct weights.

Can someone give me some advice on which weights to use for my model?
I have also read here and there that you cannot interpret R² in the same way you would when performing OLS regression. But then how should it be interpreted and can I still use it to somehow compare my WLS model to my OLS model?

Best Answer

There are two issues here

  1. You would, ideally, use weights inversely proportional to the variance of the individual $Y_i$. So says the Gauss-Markov Theorem.

  2. You don't know the variance of the individual $Y_i$

If you have deterministic weights $w_i$, you are in the situation that WLS/GLS are designed for. One traditional example is when each observation is an average of multiple measurements, and $w_i$ the number of measurements.

If you have weights that depend on the data through a small number of parameters, you can treat them as fixed and use them in WLS/GLS even though they aren't fixed. For example, you could estimate $\sigma^2(\mu)$ as a function of the fitted $\mu$ and use $w_i=1/\sigma^2(\mu_i)$ -- this seems to be what you are doing in the first example. This is also what happens in linear mixed models, where the weights for the fixed-effects part of the model depend on the variance components, which are estimated from the data.

In this scenario it is possible to prove that although there is some randomness in the weights, it does not affect the large-sample distribution of the resulting $\hat\beta$. It's ok to treat the $w_i$ as if they were known in advance.

If you have weights that are not nearly deterministic, the whole thing breaks down and the randomness in the weights becomes important for both bias and variance. That's what happens in your second example, when you use $w_i=1/r_i^2$. It's an obvious thing to think of, but it doesn't work. The estimating equations (normal equations, score equations) for $\hat\beta$ are $$\sum_i x_iw_i(y_i-x_i\beta)=0$$ With that choice of weights, you get $$\sum_i x_i\frac{(y_i-x_i\beta)}{(y_i-x_i\hat\beta^*)^2}=0$$ where $\hat\beta^*$ is the unweighted estimate. If the new estimate is close to the old one (which should be true for large data sets, because both are consistent), you'd end up with equations like $$\sum_i x_i\frac{1}{(y_i-x_i\beta)}=0$$ which divides by a variable with mean zero, a bad sign.

So:

It's ok to estimate the weights if you have a good mean model (so that the squared residuals are approximately unbiased for the variance) and as long as you don't overfit them. If you do overfit them, you will get a bad estimate of $\beta$ and inaccurate standard errors.