Solved – Regression- How to deal with this kind of non-constant variance

regressionresidualsvariance

This is a residual vs. predictor plot for my regression problem. All the other residual plots don't show clear non-constant variance, but this one definitely stands out, and its variance is not monotonic as x42 increases.

I've tried variance-stabilizing transformations (square root and log) on y and it doesn't work, quite expectedly. What are other things I can try in this case?

Edit 02/15/2017:
The problem at hand is a variable selection and OLS problem for a chemical reactor data set. I have more than 50 predictors that could possibly explain my single response $y$. $x_{42}$ is a predictor that gets picked by best subsets, and it represents the effects of catalysts added to the reactor. Initially (in time) there was no catalyst added, then there's some, and eventually full dose of catalyst is added every day. As a result, the "distribution" of $x_{42}$ takes that shape.

$x_{42}$ is not a random variable that should take any known or unknown statistical distribution, because it's a variable driven by engineering decisions. My education tells me the "distribution" of predictor doesn't matter in regression, so I didn't have doubts whether OLS applies to my data set with a variable like $x_{42}$.

Edit 02/16/2017:
Let me further clarify my objective here. What I wanted to know is:

Does this residual plot show heteroscedasticity?
I'm inclined to agree with @mdewey that most of the points are at where $x_{42}=0$ or $x_{42}=750$ so it is expected that the apparent scatter of residuals to be larger there, and this doesn't necessarily imply the variance of the residuals is highly non-constant across the range of $x_{42}$. If there are other reliable tests that can help me better determine if there's heteroscedasticity, please kindly advise.
If the amount of heteroscedasticity in this residual plot is so large that it could throw off my inference (p-value, CI, etc.), what are the remedies?
As suggested by @whuber, no monotonic transformation on $y$ would cure it in this case, and I fully agree. What other options do I have? Bootstrap? GLM? I can try all the options, but it would be difficult to gauge which method is better. So if you could shed some light on which option is intrinsically more suitable, that is highly appreciated.

Best Answer

When you have multiple predictors in your model (as it sounds like you do), then you need to plot the residuals against the predicted values for Y, not against any given predictor. The assumption about homogeneity/homoscedasticity refers to the distribution of the observed values relative to the predicted values (i.e. the residuals). Here's a visual you might find helpful:

The assumption of homoscedasticity is that the variance of the distribution of the observations relative to their predictions (i.e. the regression line) is equal. In other words, the density plots depicted all have the same variance. In the example depicted there, there is only one predictor (making it easy to show the regression on one plane with just two axes). If there were multiple predictors, then the regression line would cut through k-dimensional space for k-1 predictors --- for 2 predictors, imagine a 3D cloud of points with the line of best fit cutting through it. If you look at the residuals relative to any one of those predictors, you're potentially looking at them from a weird angle. This can be especially confusing if one of your predictors is really weirdly distributed itself, as your x42 appears to be.

In order to see whether or not you have an issue with the homoscedastictiy of your residuals, you need to plot the residuals on the y-axis and the predicted values on the x-axis. In effect, this zooms in on the regression line itself --- no matter where it is in our hypothetical k-dimensional space --- and shows you the residuals relative to the regression line. I'm not sure what software you're using, but many will easy (or even automatically) produce such a plot for you.

If you do that and you still see a problem with the variance of your residuals, then you may want to consider WLS regression instead of OLS regression. It will give observations in lower-variance areas more weight in determining the regression coefficients, allowing for the fact that you apparently have better precision there. It also has the handy side effect of reducing the influence of potential outliers in the higher-variance parts of your data.

Related Solutions

Solved – non-linear regression: Residual Plots and RMSE on raw and log target variable

Although your question seems to be based on use of a boosted regression tree algorithm (XGBRegressor, with which I have no experience), your issues seem to be some that are also faced in standard linear regression, on which I base much of my answer. How to proceed depends on what you are trying to accomplish. To start, a few issues need to be clarified.

First, a regression model is typically considered linear if it is linear in the parameters. In this terminology, non-linear transformations of predictor variables or outcome/target variables do not by themselves make a model non-linear. As you are evidently using boosted regression trees instead of a classical linear regression, it's not clear that the linear/non-linear distinction really applies, anyway.

Second, there is no need for the target variable to have a symmetric distribution in standard linear regression, although that can help with regression trees. As your appropriate focus on the residual plots indicates, the distribution of residuals is important.

Third, RMSE of your entire data set (which is what you seem to be showing) might not be the best measure of the quality of your model. Particularly if you intend to use the model for predictions on new cases, cross-validation or bootstrapping could provide much better estimates of such future performance.

Now to your questions:

The regressions for the linear and log-transformed prices attempt different things: the first tries to minimize the error in absolute terms (e.g., dollars), the second tries to minimize the error in relative/fractional terms (e.g., percentage error in predicted price). Which type of error do you care about for your application? If you care more about fractional errors, you should use the log transform and you shouldn't worry that the error in absolute terms appears bigger when you back-transform from the log scale. You should always, however, pay attention to residuals in whichever scale you choose.
Skewness in the target variable is an issue in standard linear regression provided that some predictor variables are also appropriately skewed so that residuals are not skewed. Transformation of both predictor and target variables is often needed to meet the assumptions of a linear regression and to produce well-behaved residuals. With tree-based regression approaches that use mean values of target variables to choose cutoffs for trees, removing skewness in the target variable can be recommended; the authors of ISLR do a log transform for this purpose in their example of a regression tree (pp. 304 and following). Log transformation means that residuals will be in fractional rather than absolute terms, which seems to make sense for these types of data.
You should always pay attention to residual plots.
Comparing the RMSE of the regression for non-transformed prices against the exponentiated RMSE of the regression for log-transformed prices isn't always very useful. In your log-transformed analysis, the case with the largest absolute error in the log (fractional) scale is also the lowest in absolute predicted price. That case will make a large contribution to the RMSE error in the log-transformed scale (as will several other low-price cases), but perhaps a very small contribution to RMSE in absolute terms in the regression for non-transformed prices. That might account for your observation.

In terms of how to proceed, it looks like the log-transformation of prices is useful but that your model doesn't deal too well with some of the most extremely low prices. You might need to incorporate your knowledge about the underlying subject matter, e.g. whether there is something special about such cases (other than that they don't fit well) that make them inappropriate to include in the model (for example, you might need to exclude all rents that are charged among family members, which might be lower than market rents, or rents in otherwise subsidized units if any), or whether transformations of some predictor variables might improve performance. Different choices of the tuning parameters for the boosting might help. And again, you should consider a different measure of model quality than RMSE on the entire data set.

Finally, the classic linear regression approach can outperform tree-based approaches in many situations. You could try a standard linear regression with appropriate knowledge-based selection and transformations of predictors and target variable values, with cross-validation or bootstrapping to validate your modeling approach.

How to Interpret QQ Plot Shape of Standardized Residuals

The set of examples in How to interpret a QQ plot includes the basic shape in your question. Namely, the ends of the line of points turn counter-clockwise relative to the middle. Given that sample quantiles (i.e., your data) are on the y-axis, and theoretical quantiles from a standard normal are on the x-axis, that means the tails of your distribution are fatter than what you would see from a true normal. In other words, those points are much further from the mean than you would expect if the data generating process were actually a normal distribution.

There are lots of distributions that are symmetrical and have fatter tails than the normal. I would often start by looking at $t$-distributions, because they are well understood, and you can adjust the tail 'fatness' by modulating the degrees of freedom parameter. Your example is notable in that the middle is very straight, and the ends are also very straight and roughly parallel to each other, with fairly sharp corners in between. That suggests you have a mixture of two distributions with the same mean, but different standard deviations. I can generate a plot that looks pretty similar to yours pretty easily in R with the following code:

set.seed(646)                      # this makes the example exactly reproducible
s = 4                              # this is the ratio of SDs
x = c(rnorm(11600, mean=0, sd=1),  # 99.7% of the data come from the 1st distribution
      rnorm(  400, mean=0, sd=s))  # small fraction comes from 2nd dist w/ greater SD
qqnorm(x)                          # a basic qq-plot

A better way to determine the mixing proportions and relative SDs would be to fit a Gaussian mixture model. In R, that can be done with the Mclust package, although any decent statistical software should be able to do it. I demonstrate a basic analysis in my answer to How to test if my distribution is multimodal?

You might also simply make some boxplots of your residuals as a function of your categorical variables, either individually or in specified combinations. It may well be that the heteroscedasticity can be easily found and yield meaningful insights into your data.

As @COOLserdash noted, I wouldn't worry about this for purposes of statistical inference, although if you can identify a heterogeneous subgroup, you can model your data using weighted least squares. For purposes of prediction, mean predictions should be unaffected by this, but prediction intervals based on normality will be incorrect and yield 'black swans' and occasionally cause problems. So long as you don't collapse the global financial system, it might not be so bad. You could just make the prediction intervals wider, or you could again model it, especially if the subgroups are identifiable.

Best Answer

Related Solutions

Solved – non-linear regression: Residual Plots and RMSE on raw and log target variable

How to Interpret QQ Plot Shape of Standardized Residuals

Related Question