Solved – How to determine weights for WLS regression in R

multiple regressionrweighted-regression

I am trying to predict age as a function of a set of DNA methylation markers. These predictors are continuous between 0 and 100. When performing OLS regression, I can see that variance increases with age.

Thus, I decided to fit a weighted regression model. However, I am having trouble deciding how to define the weights for my model. I have used the fGLS method, like so:

OLSressq <- OLSres^2                 # Square residuals
lnOLSressq <- log(OLSressq)          # Take natural log of squared residuals
aux <- lm(lnOLSressq~X)              # Run auxillary model
ghat <- fitted(aux)                  # Predict g^
hhat <- exp(ghat)                    # Create h^
fGLS <- lm(Y~X, weights = 1/hhat)    # Weight is 1/h^

And these were my results:

Call:
lm(formula = Y ~ X, weights = 1/hhat)

Weighted Residuals:
    Min      1Q  Median      3Q     Max 
-4.9288 -1.2491 -0.1325  1.2626  5.1452 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 23.1009494  5.2299867   4.417 1.64e-05 ***
XASPA       -0.1441404  0.0474738  -3.036  0.00271 ** 
XPDE4C       0.6421385  0.0812891   7.899 1.83e-13 ***
XELOVL2     -0.2040382  0.0866564  -2.355  0.01951 *  
XELOVL2sq    0.0088532  0.0009381   9.438  < 2e-16 ***
XEDARADD    -0.1965472  0.0348989  -5.632 5.98e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.762 on 200 degrees of freedom
Multiple R-squared:  0.9687,    Adjusted R-squared:  0.9679 
F-statistic:  1239 on 5 and 200 DF,  p-value: < 2.2e-16

However, before figuring out how to perform the fGLS method, I was playing around with different weights just to see what would happen. I used 1/(squared residuals of OLS model) as weights and ended up with this:

Call:
lm(formula = Y ~ X, weights = 1/OLSressq)

Weighted Residuals:
    Min      1Q  Median      3Q     Max 
-1.0893 -0.9916 -0.7855  0.9998  2.0238 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 30.8756737  1.1355861   27.19   <2e-16 ***
XASPA       -0.1956188  0.0116329  -16.82   <2e-16 ***
XPDE4C       0.6168490  0.0102149   60.39   <2e-16 ***
XELOVL2     -0.1596969  0.0116723  -13.68   <2e-16 ***
XELOVL2sq    0.0078459  0.0001593   49.26   <2e-16 ***
XEDARADD    -0.2492048  0.0068751  -36.25   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1 on 200 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 1.133e+06 on 5 and 200 DF,  p-value: < 2.2e-16

Since the residual standard error is smaller, R² equals 1 (is that even possible?) and the F statistic is a lot higher, I am tempted to assume this model is better than what I achieved through the fGLS method. However, it seems to me that randomly picking weights through trial and error should always yield worse results than when you actually mathematically try to estimate the correct weights.

Can someone give me some advice on which weights to use for my model?
I have also read here and there that you cannot interpret R² in the same way you would when performing OLS regression. But then how should it be interpreted and can I still use it to somehow compare my WLS model to my OLS model?

Best Answer

There are two issues here

You would, ideally, use weights inversely proportional to the variance of the individual $Y_i$. So says the Gauss-Markov Theorem.
You don't know the variance of the individual $Y_i$

If you have deterministic weights $w_i$, you are in the situation that WLS/GLS are designed for. One traditional example is when each observation is an average of multiple measurements, and $w_i$ the number of measurements.

If you have weights that depend on the data through a small number of parameters, you can treat them as fixed and use them in WLS/GLS even though they aren't fixed. For example, you could estimate $\sigma^2(\mu)$ as a function of the fitted $\mu$ and use $w_i=1/\sigma^2(\mu_i)$ -- this seems to be what you are doing in the first example. This is also what happens in linear mixed models, where the weights for the fixed-effects part of the model depend on the variance components, which are estimated from the data.

In this scenario it is possible to prove that although there is some randomness in the weights, it does not affect the large-sample distribution of the resulting $\hat\beta$. It's ok to treat the $w_i$ as if they were known in advance.

If you have weights that are not nearly deterministic, the whole thing breaks down and the randomness in the weights becomes important for both bias and variance. That's what happens in your second example, when you use $w_i=1/r_i^2$. It's an obvious thing to think of, but it doesn't work. The estimating equations (normal equations, score equations) for $\hat\beta$ are $$\sum_i x_iw_i(y_i-x_i\beta)=0$$ With that choice of weights, you get $$\sum_i x_i\frac{(y_i-x_i\beta)}{(y_i-x_i\hat\beta^*)^2}=0$$ where $\hat\beta^*$ is the unweighted estimate. If the new estimate is close to the old one (which should be true for large data sets, because both are consistent), you'd end up with equations like $$\sum_i x_i\frac{1}{(y_i-x_i\beta)}=0$$ which divides by a variable with mean zero, a bad sign.

So:

It's ok to estimate the weights if you have a good mean model (so that the squared residuals are approximately unbiased for the variance) and as long as you don't overfit them. If you do overfit them, you will get a bad estimate of $\beta$ and inaccurate standard errors.

Related Solutions

Solved – How to interpret model diagnostics when doing linear regression in R

This is a long and rambling question, so you are getting a long and rambling answer. Apologies. Using the example from the ?lm() call,

ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2,10,20, labels=c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group)
summary(lm.D9)
#output#
Call:
lm(formula = weight ~ group)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0710 -0.4938  0.0685  0.2462  1.3690 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.0320     0.2202  22.850 9.55e-15 ***
groupTrt     -0.3710     0.3114  -1.191    0.249    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.6964 on 18 degrees of freedom
Multiple R-squared: 0.07308,    Adjusted R-squared: 0.02158 
F-statistic: 1.419 on 1 and 18 DF,  p-value: 0.249

I don't entirely understand your confusion on the "coefficients." The table simply presents the OLS estimate of $\beta$, standard error of the estimate $SE(\beta)$, the "distance" that $\beta$ is from 0 on the Normal$(0, SE(\beta))$ distribution, and the probability of observing a $\beta$ that far away from 0. Forgive me for the basic statistics review; I can't tell if this is what you are asking for.

Proper OLS-estimated regression modeling (which is what the lm command runs) requires several assumptions, and these diagnostic plots are designed to test them.

The "Residuals vs Fitted" and "Scale-Location" charts are essentially the same, and show if there is a trend to the residuals. OLS models require that the residuals be "identically and independently distributed," that their distribution does not change substantially for different values of $x$. None of your charts is really satisfactory on this regard. If this assumption is not met, your $\beta$ estimates will still be good, but your $t$-statistics, and corresponding $p$-values, are garbage.

Another assumption is that the errors are approximately normally distributed, which is what the Q-Q plot allows you to see. Again, none of your plots really satisfies me in this regard. The consequences of this assumption not being met are the same as above ($\beta$'s good, $t$'s worthless).

The "outliers" principle is actually not an assumption of OLS regression. But if you have outliers in certain locations, your $\beta$ parameters will be unduly influenced by them. In this case, both your $\beta$ and $t$ measurements are useless. You can remove an influential observation from a data frame by identifying its row number and issuing the command

data <- data[-offending.row,]

Where offending.row is the number of the row you want to eliminate. The R diagnostic plots label the row numbers of potential outliers.

I don't know what kind of data you have, but you should be very careful about eliminating observations that you don't like. You should instead ask yourself how that observation became this way. If it is due to measurement error, by all means discard it. If not, then is this observation a part of the system you are trying to model? If so, you should keep it in and adapt for it in other ways.

I have two suggestions for your analysis. First, try to use GLS estimators. This method assigns weights to your observations to correct for heteroskedasticity, outliers, and some degree of non-normality. The R command for this is gls().

But it seems from your plots that your data are restricted in some ways. In particular Test-P seems like a variable that is either 1 or 0, or restricted to that range. For such a variable, you may want to look at binary logit or probit models, available with the command

glm(model, family=binomial(link="logit"))

If your data is censored at 0 but not on the upper end, a tobit model is what you want, tobit() from the AER package looks like the right thing (I've never run a tobit model, I have just looked at it theoretically).

Finally, predictions are done with the predict() function. If you want to perturb your data afterwards (to create a distribution of possible predictions), the best way I know of it to add a random number to the prediction. Using the example above,

#base prediction
pred.values <- predict(lm.D9)
# get standard error of residuals
SER <- (summary(lm.D9)$sigma)^2
#perturbations
pert <- rnorm(length(pred.values), mean=0, sd=SER)
SIMULATION.VALUES <- pred.values + pert

You can get multiple alternate simulations by repeating the last two steps. Good luck.

Solved – Robust regression in R

To expand on the advice of @kjetilbhalvorsen, here is an example of robust regression with the robustbase package. Note that the summary includes p-values for the effects and an r-squared value.

Source and load packages

### Adapted from: http://rcompanion.org/handbook/I_11.html

if(!require(robustbase)){install.packages("robustbase")}
if(!require(car)){install.packages("car")}
if(!require(multcomp)){install.packages("multcomp")}

Toy data

DV=c(1,4,3,6,5,8,1,4,3,6,5,12,1,2,3,4,5,6,1,2,3,4,5,6)
IV1 = factor(c(rep("A",6), rep("B",6), rep("C",6), rep("D",6)))
IV2 = factor(rep(c("M","N"),12))

Fit robust model and view summary

library(robustbase)

model = lmrob(DV ~ IV1 + IV2)

summary(model)

A p-value for the effects can be determined using the anova.lmrob function.

### Effect of IV1

model.2 = lmrob(DV ~ IV2)

anova(model, model.2)

### Effect of IV2

model.3 = lmrob(DV ~ IV1)

anova(model, model.3)

The documentation for car:Anova doesn't mention lmrob objects, but at least for this example, it seems to match the application of the anova.lmrob function.

library(car)

Anova(model)

Likewise, the documentation for the multcomp package doesn't mention lmrob objects, but at least for this example, the results seem reasonable.

library(multcomp)

mc = glht(model,
          mcp(IV2 = "Tukey"))

mcs = summary(mc, test=adjusted("single-step"))

mcs


mc = glht(model,
          mcp(IV1 = "Tukey"))

mcs = summary(mc, test=adjusted("single-step"))

mcs

Best Answer

Related Solutions

Solved – How to interpret model diagnostics when doing linear regression in R

Solved – Robust regression in R

Related Question