Solved – Multiple regression with no origin and mix of directly entered and stepwise entered variables using R

rstepwise regression

I am running a regression equation and I want to enter in 12 indepdendent variables then stepwise enter 7 more independent variables and not have an origin.

DV is shfl.
I want to enter in the following 12 independent dummy variables
ajan bfeb cmar dapr emay fjun gjul haug isep joct knov ldec
And then I want to enter in a stepwise fashion
slag6 slag7 slag8 slag9 slag10 slag11 slag12
And finally, I want there to be no origin.

I've done simple regression, but nothing quite like this that enters in the primary variables and step enters several more.

How can such a model be specified using R?

Best Answer

I think you can set up your base model, that is the one with your 12 IVs and then use add1() with the remaining predictors. So, say you have a model mod1 defined like mod1 <- lm(y ~ 0+x1+x2+x3) (0+ means no intercept), then

add1(mod1, ~ .+x4+x5+x6, test="F")

will add and test one predictor after the other on top of the base model.

More generally, if you know in advance that a set of variables should be included in the model (this might result from prior knowledge, or whatsoever), you can use step() or stepAIC() (in the MASS package) and look at the scope= argument.

Here is an illustration, where we specify a priori the functional relationship between the outcome, $y$, and the predictors, $x_1, x_2, \dots, x_{10}$. We want the model to include the first three predictors, but let the selection of other predictors be done by stepwise regression:

set.seed(101)
X <- replicate(10, rnorm(100))
colnames(X) <- paste("x", 1:10, sep="")
y <- 1.1*X[,1] + 0.8*X[,2] - 0.7*X[,5] + 1.4*X[,6] + rnorm(100)
df <- data.frame(y=y, X)

# say this is one of the base model we think of
fm0 <- lm(y ~ 0+x1+x2+x3+x4, data=df)

# build a semi-constrained stepwise regression
fm.step <- step(fm0, scope=list(upper = ~ 0+x1+x2+x3+x4+x5+x6+x7+x8+x9+x10, 
                                lower = ~ 0+x1+x2+x3), trace=FALSE)
summary(fm.step)

The results are shown below:

Coefficients:
   Estimate Std. Error t value Pr(>|t|)    
x1   1.0831     0.1095   9.888 2.87e-16 ***
x2   0.6704     0.1026   6.533 3.17e-09 ***
x3  -0.1844     0.1183  -1.558    0.123    
x6   1.6024     0.1142  14.035  < 2e-16 ***
x5  -0.6528     0.1029  -6.342 7.63e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 1.004 on 95 degrees of freedom
Multiple R-squared: 0.814,  Adjusted R-squared: 0.8042 
F-statistic: 83.17 on 5 and 95 DF,  p-value: < 2.2e-16

You can see that $x_3$ has been retained in the model, even if it proves to be non-significant (well, the usual caveats with univariate tests in multiple regression setting and model selection apply here -- at least, its relationship with $y$ was not specified).

Best Answer

Related Solutions

Solved – (Automated) feature selection in multiple regression with categorical variables

Solved – dumthe variables, interaction with continuous variable, and variable selection

Related Question