I recently ran a regression of the following form:
mod <- lm(log(y) ~ log(x))
To examine how y scales as a function of x. I then examined the top 20 (super-linear scaling with respect to predicted trendline) and bottom 20 residuals (sub-linear scaling with respect to predicted trendline), and noticed that all these observations shared startling regularity in a third variable, z
. (all the positive residuals have very large values of z
and all the negative residuals have very small values of z.
)
I want to be able to demonstrate that this is a meaningful, statistically significant pattern. My intuition was to run
lm(z ~ resid(mod))
but this strikes me as wrong. Is there a way to capture this pattern using residuals, or is this the wrong way of thinking about it altogether?
Re: Heteroskedastic Standard Errors
Here is a QQ plot:
Residuals vs. Fitted Values
The second graph in particular is somewhat alarming. Should I be worried about heteroskedastic standard errors with a huge dataset (300,000 obs), if I'm using robust SEs
Best Answer
Well, you want to understand how the residuals could depend on
z
. So the model you should look at would bebut maybe first (show us) the corresponding plot. But, maybe what you see is that the spread of the residuals depend on
z
not its mean (heteroskedasticity). Then try a model like(or replace the square with absolute value.) If this turns out in the confirmative, maybe try a more complex model, a
gamlss
which permits simultaneous estimation of mean and variance functions. You could look through gamlss for some examples.