Solved – residual vs. QQ-plot in multiple regression

multiple regressionpredictive-modelsqq-plotresiduals

I'm working on a Kaggle multiple regression tutorial competition and inspecting plots of my residuals. I followed a suggestion and log transformed several independent variables and the dependent variable. These are the plots I got after fitting a Ridge regression model (sample size is 1500):

enter image description here

enter image description here

I'm trying to determine how to interpret these plots, and to better understand which is useful for which purposes. I believe that the first two plots illustrate that the regression assumptions of linearity, additivity, and homoscedasticity are not violated, although at lower prices my model tends to underestimate. I think that the pattern in the QQ-plot shows a distribution more heavily tailed than normal, so the variance is higher than expected for a normal distribution, and so the normality of error assumption is violated. Is this a correct assessment? And if my goal is prediction rather than inference, should I be concerned with this QQ-plot? I also did not divide the residuals by their standard deviation because I could not find enough information about when this step is necessary.

Best Answer

Yes. To me, your top plots look pretty good. Your qq-plot shows clear non-normality / fat tails. The histogram / density plot looks pretty symmetrical, it's just that you have 'too many' residuals that are too far from the predicted line. This means the kurtosis is too large, not that the residual variance is. The variance is a parameter of a normal distribution to be fitted, so it cannot be too large.

The effect of non-normality is somewhat complex. When you want to make inferences, it can mean your p-values are wrong, but you appear to have a good amount of data, so the central limit theorem may kick in enough that it doesn't matter. If you only care about predicted means, it shouldn't have much impact. But I suspect it is likely you will want to know something about the prediction intervals as well as the means. Standard prediction intervals are based much more closely on the idea that the conditional distribution is normal than the confidence intervals. For example, the central limit theorem cannot save your prediction intervals no matter how much data you have. You might see if there is a suitable fat-tailed distribution (e.g., a low df t-distribution) that is a good fit for your residuals that you could use for forming prediction intervals.