Solved – Stepwise regression and variable selection with categorical variables in R

I am new with statistics and especially stepwise regression with categorical variables. I have 4 categorical variables, each with a different levels (5 levels, 12 levels, 7 levels, and 78 levels).
I used the following to do stepwise in R

null.lm <- lm(log(bid)~1, data=data)
full.lm <- lm(log(y)~log(x) + program + month+ region + code , data=data)
step(null.lm, scope=list(lowr=null.lm, upper=full.lm), directiom="both")

The resulting stepwise model containing the following output:

codeStructure             month02                  month03                  month04                  month05  
 0.150322                 0.103917                -0.065815                 0.007522                -0.004914

then I used

predict(step(null.lm, scope=list(lowr=null.lm, upper=full.lm), directiom="both"))

to find the predicted values.

My question is: Was I supposed to create a dummy variable for each level?I mean
for the month categorical variable: create 12 columns (1,0) and then use this in the stepwise?
I use stepwise because when fitting the linear model, not all p-values were significant, so I though of doing variable selection, but I am not sure if what i did is correct.

If I had to use the step outcome: if the observation belongs to month02, then I multiply its coefficient by 1, otherwise by 0?

Best Answer

To answer your 1st question: No, you were not supposed to create dummy variables for each level; R does that automatically for certain regression functions including lm(). If you see the output, it will have appended the variable name with the value, for example, 'month' and '02' giving you a dummy variable month02 and so on.

If you were to use the model and generate forecasts by hand, then yes, you multiply the coeff by 1 when the obs belongs to the particular category. Loosely put, (Month==02) is same as (month02==1). However, looks like you've used the predict() function which does the application of the model on (some) data automatically.

PS: there's a typo in direction= in the code you posted.

Best Answer

Related Solutions

Solved – Multiple regression interaction with categorical IV with 3 levels

Solved – Why does SAS Enterprise Miner keep all dumthe variables for a coded categorical variable in stepwise logistic regression

Related Question