Solved – Simulation from linear model with additional variables

linear modelregressionsimulation

I want to test the performance of a variable selection method in linear regression with normal errors using simulated data:

$${\bf y}= {\bf X}{\bf \beta} + \epsilon,$$

where, as usual, ${\bf y}$ is $n\times 1$, ${\bf\beta}$ is $p\times 1$, $n>p$, $\epsilon_j\stackrel{ind.}{\sim}N(0,\sigma^2)$, $j=1,\dots,n$.

How can I simulate additional superfluous variables? Is there a benchmark method for adding variables or is it an irrelevant part of the simulation? I was thinking of adding simulated columns, from some arbitrarily chosen distribution, in the design matrix and check whether or not the variable selection method detects the artificial additional variables.

Best Answer

One way would be to simulate all $x_1, x_2, ..., x_p$ together, assign each explanatory variable a coefficient, then simulate the error term $\epsilon$, and finally your dependent variable would just be the sum of the $X'\beta$ and $\epsilon$. Many statistical packages have functions where you can specify the correlation between the $x$ variables, too. In Stata, for instance, that could be achieved with the corr2data command.

Perhaps you are not using Stata but as long as you know the simulation commands in other languages the steps should be the same.

// set a certain number of observations
set obs 1000

// generate the explanatory variables (here we simulate 2 variables from a normal distribution)
gen x1 = rnormal(5,3)
gen x2 = rnormal(9,1)

// generate the error term (here is the simple case where the error is distributed as N(0,1) - for other distributions use the according sampling technique)
gen e = rnormal(0,1)

// generate the dependent variable and assign coefficients to the explanatory variables (0.5 for x1 and 0.9 for x2, for instance)
gen y = 0.5*x1 + 0.9*x2 + e

// run the linear regression of y on x1 and x2
reg y x1 x2

The corr2data command gives you much more options to specify correlations between the variables. So you can see what happens to your model if you have high collinearity between x1 and x2, you can simulate measurment error, correlations with the error, etc. It can also be used to generate a heteroscedastic relationship between one or more of the explanatory variables with the error.

Given the edit of the original question, here is also how to add superfluous variables. For this you would need to specify a correlation matrix before generating the data, for example:

    |       x1       x2       x3        e
----+------------------------------------
 x1 |   1.0000
 x2 |   0.3000   1.0000
 x3 |   0.0100  -0.0000   1.0000
  e |   0.0000  -0.0000  -0.0000   1.0000

Which can be achieved via

mat C = (1, 0.3, 0.01, 0 \ 0.3, 1, 0, 0 \ 0.01, 0, 1 , 0 \ 0, 0, 0, 1)
corr2data x1 x2 x3 e, n(1000) means(5 7 13 0) sds(3, 1, 2, 1) corr(C)
corr

where you then make x3 "superfluous" by simply not including it in the construction of y. It's not completely useless because it is correlated with x1, so through the correlation matrix C you can decide how superfluous x3 actually is. Then generating y in the same way as before

// generate the dependent variable
gen y = 0.5*x1 + 0.9*x2 + e

// run the regression with the useless variable
reg y x1 x2 x3

gives the result you wanted.

But this would be the generic set-up to generate such data which should work in any other statistical package in the same way. Presumably other packages have additional/different functions that you can use but the steps done here are a basic way to achieve this.

Related Solutions

Solved – Linear regression simulation

Here is a suggestion:

## Parameters
n     <- 20                         # number of samples
rho   <- 0.5                        # correlation between predictors
beta  <- c(0.8,3,1.5,0,0,2,0,0,0)   # regression coefficients
sigma <- 0.4                        # std error of noise (assumed Gaussian)
## Covariance matrix
d     <- length(beta)
Q     <- toeplitz(c(1, rep(rho, d - 2)))
## Design matrix
X <- cbind(1, matrix(rnorm((d-1) * n), ncol = d-1) %*% chol(Q))
## Dependent variable
Y <- X %*% beta + rnorm(n, 0, sigma)

Solved – How to simulate an unreplicated factorial design

The key to generating random data like yours in R is to use ?rnorm. You may also want to set the random seed to a fixed value, via ?set.seed, so that your simulation can be exactly replicated in the future. For convenience, you may prefer to use ?expand.grid to create your factor combinations, although you can do it manually as well (and could conceivably prefer that for the clarity it affords). Once you have set up your simulation, you will want to run it many times to get a sense of its long run behavior; this can be done using ?replicate, I believe, but I usually just nest it in a for loop because I specialize in writing comically inefficient code.

Here is an example:

> set.seed(1)
> A <- c(0,1)
> B <- c(0,1)
> C <- c(0,1)
> D <- c(0,1)
> Xmat <- expand.grid(A=A, B=B, C=C, D=D)
> Xmat
   A B C D
1  0 0 0 0
2  1 0 0 0
3  0 1 0 0
4  1 1 0 0
5  0 0 1 0
6  1 0 1 0
7  0 1 1 0
8  1 1 1 0
9  0 0 0 1
10 1 0 0 1
11 0 1 0 1
12 1 1 0 1
13 0 0 1 1
14 1 0 1 1
15 0 1 1 1
16 1 1 1 1
> y <- 3 + 1.5*Xmat$A + 1.5*Xmat$B + 1.5*Xmat$C + 1.5*Xmat$B*Xmat$C + 
+      rnorm(n=16, mean=0, sd=1)
> y
 [1] 2.373546 4.683643 3.664371 7.595281 4.829508 5.179532 7.987429 9.738325
 [9] 3.575781 4.194612 6.011781 6.389843 3.878759 3.785300 8.624931 8.955066

Best Answer

Related Solutions

Solved – Linear regression simulation

Solved – How to simulate an unreplicated factorial design

Related Question