Solved – Does it make sense to use residuals as an independent variable

heteroscedasticitymodelingrregressionresiduals

I recently ran a regression of the following form:

mod <- lm(log(y) ~ log(x))

To examine how y scales as a function of x. I then examined the top 20 (super-linear scaling with respect to predicted trendline) and bottom 20 residuals (sub-linear scaling with respect to predicted trendline), and noticed that all these observations shared startling regularity in a third variable, z. (all the positive residuals have very large values of z and all the negative residuals have very small values of z.)

I want to be able to demonstrate that this is a meaningful, statistically significant pattern. My intuition was to run

lm(z ~ resid(mod))

but this strikes me as wrong. Is there a way to capture this pattern using residuals, or is this the wrong way of thinking about it altogether?

Re: Heteroskedastic Standard Errors

Here is a QQ plot:

enter image description here

Residuals vs. Fitted Values

enter image description here

The second graph in particular is somewhat alarming. Should I be worried about heteroskedastic standard errors with a huge dataset (300,000 obs), if I'm using robust SEs

Best Answer

Well, you want to understand how the residuals could depend on z. So the model you should look at would be

lm( resid(mod) ~ z )

but maybe first (show us) the corresponding plot. But, maybe what you see is that the spread of the residuals depend on z not its mean (heteroskedasticity). Then try a model like

lm( Id(resid(mod)^2) ~ z )

(or replace the square with absolute value.) If this turns out in the confirmative, maybe try a more complex model, a gamlss which permits simultaneous estimation of mean and variance functions. You could look through for some examples.

Related Question