Weighted Regression – How to Find Weights for Weighted Least Squares Regression?

heteroscedasticityregressionweighted-regression

I am a bit lost in the process of WLS regression. I have been given dataset and my task is to test whether there is heteroscedascity, and if so I should run WLS regression.

I have carried out the test and found evidence for heteroscedascity, so I need to run the WLS. I have been told that WLS is basically OLS regression of a transformed model, but I am a bit confused about finding the transformation function. I have read some articles which suggested that the transformation can be function of squared residuals from OLS regression, but I would appreciate If someone can help me to get on the right track.

Best Answer

Weighted least squares (WLS) regression is not a transformed model. Instead, you are simply treating each observation as more or less informative about the underlying relationship between $X$ and $Y$. Those points that are more informative are given more 'weight', and those that are less informative are given less weight. You are right that weighted least squares (WLS) regression is technically only valid if the weights are known a-priori.

However, (OLS) linear regression is fairly robust against heteroscedasticity and thus so is WLS if your estimates are in the ballpark. A rule of thumb for OLS regression is that it isn't too impacted by heteroscedasticity as long as the maximum variance is not greater than 4 times the minimum variance. For example, if the variance of the residuals / errors increases with $X$, then you would be OK if the variance of the residuals at the high end were less than four times the variance of the residuals at the low end. The implication of this is that if your weights get you within that range, you are reasonably safe. It's kind of a horseshoes and hand grenades situation. As a result, you can try to estimate the function relating the variance of the residuals to the levels of your predictor variables.

There are several issues pertaining to how such estimation should be done:

  1. Remember that the weights should be the reciprocal of the variance (or whatever you use).

  2. If your data occur only at discrete levels of $X$, like in an experiment or an ANOVA, then you can estimate the variance directly at each level of $X$ and use that. If the estimates are discrete levels of a continuous variable (e.g., 0 mg., 10 mg., 20 mg., etc.), you may want to smooth those, but it probably won't make much difference.

  3. Estimates of variances, due to the squaring, are very susceptible to outliers and/or high leverage points, though. If your data are not evenly distributed across $X$, or you have relatively few data, estimating the variance directly is not recommended. It is better to estimate something that is expected to correlate with variance, but which is more robust. A common choice would be to use the square root of the absolute values of the deviations from the conditional mean. (For example, in R, plot(model, which=2) will display a scatterplot of these against $X$, called a "spread level plot", to help you diagnose potential heteroscedasticity; see my answer here.) Even more robust might be to use the conditional interquartile range, or the conditional median absolute deviation from the median.

  4. If $X$ is a continuous variable, the typical strategy is to use a simple OLS regression to get the residuals, and then regress one of the functions in [3] (most likely the root absolute deviation) onto $X$. The predicted value of this function is used for the weight associated with that point.

  5. Getting your weights from the residuals of an OLS regression is reasonable because OLS is unbiased, even in the presence of heteroscedasticity. Nonetheless, those weights are contingent on the original model, and may change the fit of the subsequent WLS model. Thus, you should check your results by comparing the estimated betas from the two regressions. If they are very similar, you are OK. If the WLS coefficients diverge from the OLS ones, you should use the WLS estimates to compute residuals manually (the reported residuals from the WLS fit will take the weights into account). Having calculated a new set of residuals, determine the weights again and use the new weights in a second WLS regression. This process should be repeated until two sets of estimated betas are sufficiently similar (even doing this once is uncommon, though).

If this process makes you somewhat uncomfortable, because the weights are estimated, and because they are contingent on the earlier, incorrect model, another option is to use the Huber-White 'sandwich' estimator. This is consistent even in the presence of heteroscedasticity no matter how severe, and it isn't contingent on the model. It is also potentially less hassle.

I demonstrate a simple version of weighted least squares and the use of the sandwich SEs in my answer here: Alternatives to one-way ANOVA for heteroscedastic data.