Solved – Stepwise regression and variable selection with categorical variables in R

categorical datamultiple regressionpredictive-modelsstepwise regression

I am new with statistics and especially stepwise regression with categorical variables. I have 4 categorical variables, each with a different levels (5 levels, 12 levels, 7 levels, and 78 levels).
I used the following to do stepwise in R

null.lm <- lm(log(bid)~1, data=data)
full.lm <- lm(log(y)~log(x) + program + month+ region + code , data=data)
step(null.lm, scope=list(lowr=null.lm, upper=full.lm), directiom="both")

The resulting stepwise model containing the following output:

codeStructure             month02                  month03                  month04                  month05  
 0.150322                 0.103917                -0.065815                 0.007522                -0.004914

then I used

predict(step(null.lm, scope=list(lowr=null.lm, upper=full.lm), directiom="both"))

to find the predicted values.

My question is: Was I supposed to create a dummy variable for each level?I mean
for the month categorical variable: create 12 columns (1,0) and then use this in the stepwise?
I use stepwise because when fitting the linear model, not all p-values were significant, so I though of doing variable selection, but I am not sure if what i did is correct.

If I had to use the step outcome: if the observation belongs to month02, then I multiply its coefficient by 1, otherwise by 0?

Best Answer

To answer your 1st question: No, you were not supposed to create dummy variables for each level; R does that automatically for certain regression functions including lm(). If you see the output, it will have appended the variable name with the value, for example, 'month' and '02' giving you a dummy variable month02 and so on.

If you were to use the model and generate forecasts by hand, then yes, you multiply the coeff by 1 when the obs belongs to the particular category. Loosely put, (Month==02) is same as (month02==1). However, looks like you've used the predict() function which does the application of the model on (some) data automatically.

PS: there's a typo in direction= in the code you posted.