I am trying to simulate a dataset that matches empirical data that I have, but am unsure how to estimate the errors in the original data. The empirical data includes heteroscedasticity, but I am not interested in transforming it away, but rather using a linear model with an error term to reproduce simulations of the empirical data.
For example, let's say I have some an empirical dataset and a model:
n=rep(1:100,2)
a=0
b = 1
sigma2 = n^1.3
eps = rnorm(n,mean=0,sd=sqrt(sigma2))
y=a+b*n + eps
mod <- lm(y ~ n)
using plot(n,y)
we get the following.
However, if I try to simulate the data, simulate(mod)
, the heteroscedasticity is removed and not captured by the model.
I can use a generalized least squares model
VMat <- varFixed(~n)
mod2 = gls(y ~ n, weights = VMat)
that provides a better model fit based on AIC, but I don't know how to simulate data using the output.
My question is, how do I create a model that will allow me to simulate data to match the original, empirical data (n and y above). Specifically, I need a way to estimate sigma2, the error, using either using a model?
Best Answer
To simulate data with a varying error variance, you need to specify the data generating process for the error variance. As has been pointed out in the comments, you did that when you generated your original data. If you have real data and want to try this, you just need to identify the function that specifies how the residual variance depends on your covariates. The standard way to do that is to fit your model, check that it is reasonable (other than the heteroscedasticity), and save the residuals. Those residuals become the Y variable of a new model. Below I have done that for your data generating process. (I don't see where you set the random seed, so these won't literally be the same data, but should be similar, and you can reproduce mine exactly by using my seed.)
Note that
R
's ?plot.lm will give you a plot (cf., here) of the square root of the absolute values of the residuals, helpfully overlaid with a lowess fit, which is just what you need. (If you have multiple covariates, you might want to assess this against each covariate separately.) There is the slightest hint of a curve, but this looks like a straight line does a good job of fitting the data. So let's explicitly fit that model:We needn't be concerned that the residual variance seems to be increasing in the scale-location plot for this model as well—that essentially has to happen. There is again the slightest hint of a curve, so we can try to fit a squared term and see if that helps (but it doesn't):
If we're satisfied with this, we can now use this process as an add-on to simulate data.
Note that this process is no more guaranteed to find the true data generating process than any other statistical method. You used a non-linear function to generate the error SDs, and we approximated it with a linear function. If you actually know the true data generating process a-priori (as in this case, because you simulated the original data), you might as well use it. You can decide if the approximation here is good enough for your purposes. We typically don't know the true data generating process, however, and based on Occam's razor, go with the simplest function that adequately fits the data we have given the amount of information available. You can also try splines or fancier approaches if you prefer. The bivariate distributions look reasonably similar to me, but we can see that while the estimated function largely parallels the true function, they do not overlap: