Solved – Multiple regression with no origin and mix of directly entered and stepwise entered variables using R

rstepwise regression

I am running a regression equation and I want to enter in 12 indepdendent variables then stepwise enter 7 more independent variables and not have an origin.

  • DV is shfl.

  • I want to enter in the following 12 independent dummy variables
    ajan
    bfeb
    cmar
    dapr
    emay
    fjun
    gjul
    haug
    isep
    joct
    knov
    ldec

  • And then I want to enter in a stepwise fashion
    slag6
    slag7
    slag8
    slag9
    slag10
    slag11
    slag12

  • And finally, I want there to be no origin.

I've done simple regression, but nothing quite like this that enters in the primary variables and step enters several more.

  • How can such a model be specified using R?

Best Answer

I think you can set up your base model, that is the one with your 12 IVs and then use add1() with the remaining predictors. So, say you have a model mod1 defined like mod1 <- lm(y ~ 0+x1+x2+x3) (0+ means no intercept), then

add1(mod1, ~ .+x4+x5+x6, test="F")

will add and test one predictor after the other on top of the base model.

More generally, if you know in advance that a set of variables should be included in the model (this might result from prior knowledge, or whatsoever), you can use step() or stepAIC() (in the MASS package) and look at the scope= argument.

Here is an illustration, where we specify a priori the functional relationship between the outcome, $y$, and the predictors, $x_1, x_2, \dots, x_{10}$. We want the model to include the first three predictors, but let the selection of other predictors be done by stepwise regression:

set.seed(101)
X <- replicate(10, rnorm(100))
colnames(X) <- paste("x", 1:10, sep="")
y <- 1.1*X[,1] + 0.8*X[,2] - 0.7*X[,5] + 1.4*X[,6] + rnorm(100)
df <- data.frame(y=y, X)

# say this is one of the base model we think of
fm0 <- lm(y ~ 0+x1+x2+x3+x4, data=df)

# build a semi-constrained stepwise regression
fm.step <- step(fm0, scope=list(upper = ~ 0+x1+x2+x3+x4+x5+x6+x7+x8+x9+x10, 
                                lower = ~ 0+x1+x2+x3), trace=FALSE)
summary(fm.step)

The results are shown below:

Coefficients:
   Estimate Std. Error t value Pr(>|t|)    
x1   1.0831     0.1095   9.888 2.87e-16 ***
x2   0.6704     0.1026   6.533 3.17e-09 ***
x3  -0.1844     0.1183  -1.558    0.123    
x6   1.6024     0.1142  14.035  < 2e-16 ***
x5  -0.6528     0.1029  -6.342 7.63e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 1.004 on 95 degrees of freedom
Multiple R-squared: 0.814,  Adjusted R-squared: 0.8042 
F-statistic: 83.17 on 5 and 95 DF,  p-value: < 2.2e-16 

You can see that $x_3$ has been retained in the model, even if it proves to be non-significant (well, the usual caveats with univariate tests in multiple regression setting and model selection apply here -- at least, its relationship with $y$ was not specified).