I am new with statistics and especially stepwise regression with categorical variables. I have 4 categorical variables, each with a different levels (5 levels, 12 levels, 7 levels, and 78 levels).
I used the following to do stepwise in R
null.lm <- lm(log(bid)~1, data=data)
full.lm <- lm(log(y)~log(x) + program + month+ region + code , data=data)
step(null.lm, scope=list(lowr=null.lm, upper=full.lm), directiom="both")
The resulting stepwise model containing the following output:
codeStructure month02 month03 month04 month05
0.150322 0.103917 -0.065815 0.007522 -0.004914
then I used
predict(step(null.lm, scope=list(lowr=null.lm, upper=full.lm), directiom="both"))
to find the predicted values.
My question is: Was I supposed to create a dummy variable for each level?I mean
for the month categorical variable: create 12 columns (1,0) and then use this in the stepwise?
I use stepwise because when fitting the linear model, not all p-values were significant, so I though of doing variable selection, but I am not sure if what i did is correct.
If I had to use the step outcome: if the observation belongs to month02, then I multiply its coefficient by 1, otherwise by 0?
Best Answer
To answer your 1st question: No, you were not supposed to create dummy variables for each level;
R
does that automatically for certain regression functions includinglm()
. If you see the output, it will have appended the variable name with the value, for example, 'month' and '02' giving you a dummy variablemonth02
and so on.If you were to use the model and generate forecasts by hand, then yes, you multiply the coeff by 1 when the obs belongs to the particular category. Loosely put,
(Month==02)
is same as(month02==1)
. However, looks like you've used thepredict()
function which does the application of the model on (some) data automatically.PS: there's a typo in
direction=
in the code you posted.