Solved – Achieve continuous predictions in linear regression with all categorical independent variables

categorical datamultiple regressionregression

I am working my way for the first time through predicting a continuous dependent variable in a problem where all independent variables are categorical. Say I need to predict a continuous variable starting from a single categorical independent variable that can take 4 values. After dummy-encoding the categorical variable, a simple linear regression would find the coefficients, and predict discrete values (in my case, predictions would be four values). Is there any way to smooth my predictions?

Best Answer

Not in a meaningful way, unless you have additional information to add to your model.

If all you know is the values of these categorical predictors, then any data belonging to the same category (or combination of categories) will obviously be predicted by your regression model to have the exact same value. To change that, you'd need to have some way of additionally distinguishing points within the same category. Barring that, I don't see how you could say something like "actually this point should be a bit higher than the others" (other than choosing randomly).

Smoothing already implies some additional dimension of interest in the data. That is, there has to be something to "smooth over". E.g. if you had time series data you might have reason to smooth your predictions over time, taking a weighted average of data from adjoining time points. (Although I'd say it generally makes more sense to add this dimension to your model explicitly at the fitting stage, rather than smooth predictions from a model without it.)

Update: Also, it's not necessarily a problem if your predictions are discrete. As Glen_b touched upon as well, a linear regression assumes the data follow a continuous (Normal) distribution around some mean (that is a function of your regressors), but that mean can be discrete, and in fact have any arbitrary distribution you fancy. So if you think your categorical regressors are a good (enough) model for your data, there's no reason to be concerned.