Solved – Split dataset by categorical variable or use as a dumthe/factor variable

categorical datafeature selectionmachine learningregression

I'm looking for any sort of best practice or ways to go about this situation.

Often I come across datasets that have a categorical variable that I am tempted to split off the main dataset into subsets or to code as a categorical.

For example, I might be trying to look into the price of a car depending upon where it is sold – Asia or Europe. If I am am trying to run a OLS regression, random forest, gbm, lasso, etc – what is the best practice or things that should go through my head here.

If say the Age or MPG of a car is valued differently in Asia vs Europe, will the factor variable account for that in the model to produce results similar to that if I just split into two datasets?

Yes, I realize that splitting by the categorical variable removes the ability to 'see' directly that variables impact, but beyond this I'm looking for guidance. This is a simple example, but I often get approached with a situation like this where I need to determine how to come up with all the various groupings and training datasets and what not.

Best Answer

First off, have a look at this question and answer that is close to what you are asking.

If you asume that Age or MPG are valued differently in Asia and Europe, then simply adding the dummy variable into the model does not solve this. The dummy only captures the level effect and not the slope effect. You can see this because the dummy does not show up in the derivative $\frac{\partial Price}{\partial Age}$.

Without loss of generality, assume that there are only two groups such that K=2 and one explanatory variable.

The model is thus $y_i=\alpha+\beta_x*X_i+u_i$ where you create a dummy variable $D$ such that you code the group 1 or group 2.

Essentially, you have several choices of models:

  • $(y_i|D=0)=\gamma+\delta_x*X_i+u_i$ if $D=0$ (1)
  • $(y_i|D=1)=\kappa+\phi_x*X_i+u_i$ if $D=1$ (2)
  • $y_i=\mu + \nu*X_i + \pi*D+u_i$ (3)

When splitting the dataset in two parts, you have the following:

  • Yes, you have a better fit of the data than if you simply add the dummy: you have both an intercept and a slope that is group-specific.
  • Unfortunately, you have fewer observations, which cause your estimates to be less precisely estimated.
  • Compare the sum of the Residuals Sum of Squares of models 1 to 3. That is, $RSS_1+RSS_2<RSS_3$ You are better off splitting your dataset.
  • The $R^2$, however, is larger in model three than the weighted sum of $R^2$ of the other two models.

When fully interacting your model, it looks like this:

$y_i=\alpha+\beta_D*D+\beta_x*X_i+\beta_{Dx}*D*X_i+u_i$ (4)

You have the following:

  • Intercept and slope are group-specific
  • $RSS_1+RSS_2=RSS_4$: which means that the model (4) fits the data as well as two models.

Notice that splitting or fully interacting involves differences for the estimation of the variance-covariance matrix of regressors. When fully interacting, you also come across the problem that the number of regressors rapidly increases. These are issues you need to take into account, too.