Solved – Simulation from linear model with additional variables

linear modelregressionsimulation

I want to test the performance of a variable selection method in linear regression with normal errors using simulated data:

$${\bf y}= {\bf X}{\bf \beta} + \epsilon,$$

where, as usual, ${\bf y}$ is $n\times 1$, ${\bf\beta}$ is $p\times 1$, $n>p$, $\epsilon_j\stackrel{ind.}{\sim}N(0,\sigma^2)$, $j=1,\dots,n$.

How can I simulate additional superfluous variables? Is there a benchmark method for adding variables or is it an irrelevant part of the simulation? I was thinking of adding simulated columns, from some arbitrarily chosen distribution, in the design matrix and check whether or not the variable selection method detects the artificial additional variables.

Best Answer

One way would be to simulate all $x_1, x_2, ..., x_p$ together, assign each explanatory variable a coefficient, then simulate the error term $\epsilon$, and finally your dependent variable would just be the sum of the $X'\beta$ and $\epsilon$. Many statistical packages have functions where you can specify the correlation between the $x$ variables, too. In Stata, for instance, that could be achieved with the corr2data command.

Perhaps you are not using Stata but as long as you know the simulation commands in other languages the steps should be the same.

// set a certain number of observations
set obs 1000

// generate the explanatory variables (here we simulate 2 variables from a normal distribution)
gen x1 = rnormal(5,3)
gen x2 = rnormal(9,1)

// generate the error term (here is the simple case where the error is distributed as N(0,1) - for other distributions use the according sampling technique)
gen e = rnormal(0,1)

// generate the dependent variable and assign coefficients to the explanatory variables (0.5 for x1 and 0.9 for x2, for instance)
gen y = 0.5*x1 + 0.9*x2 + e

// run the linear regression of y on x1 and x2
reg y x1 x2

The corr2data command gives you much more options to specify correlations between the variables. So you can see what happens to your model if you have high collinearity between x1 and x2, you can simulate measurment error, correlations with the error, etc. It can also be used to generate a heteroscedastic relationship between one or more of the explanatory variables with the error.

Given the edit of the original question, here is also how to add superfluous variables. For this you would need to specify a correlation matrix before generating the data, for example:

    |       x1       x2       x3        e
----+------------------------------------
 x1 |   1.0000
 x2 |   0.3000   1.0000
 x3 |   0.0100  -0.0000   1.0000
  e |   0.0000  -0.0000  -0.0000   1.0000

Which can be achieved via

mat C = (1, 0.3, 0.01, 0 \ 0.3, 1, 0, 0 \ 0.01, 0, 1 , 0 \ 0, 0, 0, 1)
corr2data x1 x2 x3 e, n(1000) means(5 7 13 0) sds(3, 1, 2, 1) corr(C)
corr

where you then make x3 "superfluous" by simply not including it in the construction of y. It's not completely useless because it is correlated with x1, so through the correlation matrix C you can decide how superfluous x3 actually is. Then generating y in the same way as before

// generate the dependent variable
gen y = 0.5*x1 + 0.9*x2 + e

// run the regression with the useless variable
reg y x1 x2 x3

gives the result you wanted.

But this would be the generic set-up to generate such data which should work in any other statistical package in the same way. Presumably other packages have additional/different functions that you can use but the steps done here are a basic way to achieve this.

Related Question