Solved – dumthe variables, interaction with continuous variable, and variable selection

categorical datainteractionstepwise regression

I want to predict shop sales from a set of independent variables which consists of shop attributes like floor space, no. of stuff of a specific store (continuous variables) and also location of the store which is a categorical variable (binary coding) like east west south. I have some questions.

If I run stepwise regression for variable selection and if one of the included dummy variable gets dropped , what does that mean?
Is it necessary to include interaction terms with dummy variables and continuous variables if they are significant? I am asking this because my motive is to predict sales.
Should I include interaction terms before running stepwise regression?

Best Answer

If by dummy variables you're referring to multiple binary variables that make up one categorical predictor, each of them needs to be in the model for each other dummy to be meaningful. In stepwise regression either they are all in or all out, but not piecemeal. Are you doing this by hand or something? All stats packages I'm familiar with treat multilevel categoricals properly in this respect, and shouldn't consider dummy variables independently for model specification.
Again, you can't include interactions with some dummy variables of a single categorical predictor but not others. All in or all out. The test of whether the interaction needs to be included is a comparison between a model without interactions with all dummies and a model with interactions with all dummies. If the interaction is significant, you should keep it in any case. Just be aware that the interpretation of the "main effects" changes drastically when interactions are included in models.
If doing backwards stepwise regression, include the interaction terms.

Related Solutions

Solved – Multiple regression interaction with categorical IV with 3 levels

You can definitely do that. you can introduce your categorical variable as a factorial one. If you have decided to use R programming this following code would be fine:

new_categ<-factor(categ,labels=c(0:2))

Then, you can interact the new categorical variable with other independent ones. You also could find examples centered around your problem in Modern Applied Statistics with S-PLUS by Venables and Ripley. However, if you are not willing to use R, you can still read its examples about regression which are beneficial for figuring out how to solve your problem.

Solved – How to reference R’s automatically generated dumthe variables in a linear model

If you use the proper multiplicative notation, the model coefficients need to account for the interpretation of the intercept term. Assume WLOG that only color and ph are in the model (this is a superfluous example you've provided). In this case, if "color==red" is the default group. There's technically only 1 dummy in the model, 1 if color is white, 0 otherwise.

Then, fitting the pH interaction, the "colorwhite" parameter is interpreted as the expected difference in the outcome comparing white to red having a pH of exactly 0. There is also a pH parameter interpreted as an expected difference in the outcome comparing groups differing by 1 unit in pH having color red. Lastly, the "colorwhite:pH" parameter is interpreted as a difference in differences for those groups, i.e. the incremental change in the pH slope comparing whites to reds.

I think you should rewrite your color formula to remove ":" and replace them with "*"

> set.seed(1)
> a <- sample(letters[1:3], 100, replace=TRUE)
> b <- sample(LETTERS[1:3], 100, replace=TRUE)
> y <- rnorm(100)
> lm(y ~ a * b)

Call:
lm(formula = y ~ a * b)

Coefficients:
(Intercept)           ab           ac           bB           bC        ab:bB  
     0.1684      -0.3894      -0.2614      -0.2807      -0.3981       0.8720  
      ac:bB        ab:bC        ac:bC  
     0.2099       0.6215       0.4547  

> lm(y ~ a : b) ## wrong

Call:
lm(formula = y ~ a:b)

Coefficients:
(Intercept)        aa:bA        ab:bA        ac:bA        aa:bB        ab:bB  
   -0.03642      0.20484     -0.18456     -0.05654     -0.07587      0.40675  
      ac:bB        aa:bC        ab:bC        ac:bC  
   -0.12738     -0.19331      0.03883           NA  

> tapply(y, interaction(a, b), mean)
         a.A          b.A          c.A          a.B          b.B          c.B 
 0.168415779 -0.220978904 -0.092958625 -0.112286696  0.370329364 -0.163796519 
         a.C          b.C          c.C 
-0.229732632  0.002406992 -0.036419652

Best Answer

Related Solutions

Solved – Multiple regression interaction with categorical IV with 3 levels

Solved – How to reference R’s automatically generated dumthe variables in a linear model

Related Question