I have a categorical independent variable along with other continuous independent variables. Should I dummy code it or should I just treat it as a nominal variable?
Solved – Multiple Linear Regression – categorical variables
multiple regression
Related Solutions
One advantage (out of many) of using R is that it takes care of this problem for you -- no need to assign dummies (just make sure the categorical variables are entered as strings instead of numbers).
Some basics: Multiple regression in R
Not in a meaningful way, unless you have additional information to add to your model.
If all you know is the values of these categorical predictors, then any data belonging to the same category (or combination of categories) will obviously be predicted by your regression model to have the exact same value. To change that, you'd need to have some way of additionally distinguishing points within the same category. Barring that, I don't see how you could say something like "actually this point should be a bit higher than the others" (other than choosing randomly).
Smoothing already implies some additional dimension of interest in the data. That is, there has to be something to "smooth over". E.g. if you had time series data you might have reason to smooth your predictions over time, taking a weighted average of data from adjoining time points. (Although I'd say it generally makes more sense to add this dimension to your model explicitly at the fitting stage, rather than smooth predictions from a model without it.)
Update: Also, it's not necessarily a problem if your predictions are discrete. As Glen_b touched upon as well, a linear regression assumes the data follow a continuous (Normal) distribution around some mean (that is a function of your regressors), but that mean can be discrete, and in fact have any arbitrary distribution you fancy. So if you think your categorical regressors are a good (enough) model for your data, there's no reason to be concerned.
Best Answer
I assume you have more than two categories, because if there are only two categories, it's not an issue.
That depends on how your statistics software handles categorical variables. In R, they are called factors, and if you include a factor in a regression model, it will automatically be dummy coded. However, if the categorical variable is not a factor, but a numerical variable, R will handle it as such, and you will need to specify it as a factor:
factor(variable)
to use it as a categorical variable (and R will create the dummy variables for you).In SPSS, which is the other statistics software that I'm familiar with, nominal variables will be treated as continuous unless you specify that they are categorical via the "categorical" button in the regression dialog box.
In neither R nor SPSS you need to create the dummy variables yourself, and I imagine it's the same for most other statistics software today. So in my mind, there is no difference between dummy coding the variable and treat it as a nominal variable, because it's the same thing.