Multiple Regression – How to Identify the Type of Residual Plot for a Variable

multiple regressionresiduals

I am doing a multiple regression analysis and my focus is finding the best set of independent variables for prediction. I am starting to know my dataset and the behavior of each variable. I am doing a residuals analysis (with a great help of R) and my question is about the meaning of the residual plot for one of these variables.

Studentized residual plot

Can I say that these variable residuals are a Null Plot kind with some outliers? Moreover, what does it help for my goal of finding good variables?

enter image description here

Best Answer

Maybe -- but it does also have some characteristics of the horn-shaped plot you get when a transformation might help. Are these ordinary residuals, or some kind of standardized ones?

The reason I ask is that it's not unusual to see a downward-sloping edge in a residuals-vs-predicted plot; it happens when there is a frequently-attained lower bound (e.g., zero) on the $y$ values. However, if that is the case, that lower edge should have a slope of $-1$ and the slope in the plot is more like $-0.1$. But if the residuals are standardized, that'd explain it.

You can use Tukey's nonadditivity test to see if a transformation might help. The technique is as follows:

  1. Obtain the predicted values, $\hat y_i$
  2. Compute the variable $N$ with values $N_i = \hat y_i^2$
  3. Fit the same model with $N$ as an additional predictor
  4. If the $t$ statistic for $N$ is significant (this is the test of Tukey's one d.f. for nonadditivity), it suggests that a transformation of the response might help. As a rough estimate, use $y$ raised to the $1-\hat\beta_N$ power, or $\log y$ if this is nearly zero.

Note: This is only for diagnostic purposes. Don't include $N$ in your final model, or in any steps along the way! Another note: A similar idea is the Atkinson score test, where you use $N_i = \hat y_i\log\hat y_i$

An additional suggestion is to plot residuals against everything you can think of (time order, predictors in the model, predictors not in the model) to see if there is any kind of apparent pattern in those.

And one more comment: Sometimes, a bad residual plot is good news! A really poor-fitting model often has a nice residual plot but doesn't predict the response worth a darn. When the residual plot starts looking bad, it can mean that you've explained enough of the variations in the response that you can now see the more minor defects in the model.

Related Question