I want to test the performance of a variable selection method in linear regression with normal errors using simulated data:
$${\bf y}= {\bf X}{\bf \beta} + \epsilon,$$
where, as usual, ${\bf y}$ is $n\times 1$, ${\bf\beta}$ is $p\times 1$, $n>p$, $\epsilon_j\stackrel{ind.}{\sim}N(0,\sigma^2)$, $j=1,\dots,n$.
How can I simulate additional superfluous variables? Is there a benchmark method for adding variables or is it an irrelevant part of the simulation? I was thinking of adding simulated columns, from some arbitrarily chosen distribution, in the design matrix and check whether or not the variable selection method detects the artificial additional variables.
Best Answer
One way would be to simulate all $x_1, x_2, ..., x_p$ together, assign each explanatory variable a coefficient, then simulate the error term $\epsilon$, and finally your dependent variable would just be the sum of the $X'\beta$ and $\epsilon$. Many statistical packages have functions where you can specify the correlation between the $x$ variables, too. In Stata, for instance, that could be achieved with the
corr2data
command.Perhaps you are not using Stata but as long as you know the simulation commands in other languages the steps should be the same.
The
corr2data
command gives you much more options to specify correlations between the variables. So you can see what happens to your model if you have high collinearity between x1 and x2, you can simulate measurment error, correlations with the error, etc. It can also be used to generate a heteroscedastic relationship between one or more of the explanatory variables with the error.Given the edit of the original question, here is also how to add superfluous variables. For this you would need to specify a correlation matrix before generating the data, for example:
Which can be achieved via
where you then make x3 "superfluous" by simply not including it in the construction of y. It's not completely useless because it is correlated with x1, so through the correlation matrix C you can decide how superfluous x3 actually is. Then generating y in the same way as before
gives the result you wanted.
But this would be the generic set-up to generate such data which should work in any other statistical package in the same way. Presumably other packages have additional/different functions that you can use but the steps done here are a basic way to achieve this.