Solved – Split dataset by categorical variable or use as a dumthe/factor variable

categorical datafeature selectionmachine learningregression

I'm looking for any sort of best practice or ways to go about this situation.

Often I come across datasets that have a categorical variable that I am tempted to split off the main dataset into subsets or to code as a categorical.

For example, I might be trying to look into the price of a car depending upon where it is sold – Asia or Europe. If I am am trying to run a OLS regression, random forest, gbm, lasso, etc – what is the best practice or things that should go through my head here.

If say the Age or MPG of a car is valued differently in Asia vs Europe, will the factor variable account for that in the model to produce results similar to that if I just split into two datasets?

Yes, I realize that splitting by the categorical variable removes the ability to 'see' directly that variables impact, but beyond this I'm looking for guidance. This is a simple example, but I often get approached with a situation like this where I need to determine how to come up with all the various groupings and training datasets and what not.

Best Answer

First off, have a look at this question and answer that is close to what you are asking.

If you asume that Age or MPG are valued differently in Asia and Europe, then simply adding the dummy variable into the model does not solve this. The dummy only captures the level effect and not the slope effect. You can see this because the dummy does not show up in the derivative $\frac{\partial Price}{\partial Age}$.

Without loss of generality, assume that there are only two groups such that K=2 and one explanatory variable.

The model is thus $y_i=\alpha+\beta_x*X_i+u_i$ where you create a dummy variable $D$ such that you code the group 1 or group 2.

Essentially, you have several choices of models:

$(y_i|D=0)=\gamma+\delta_x*X_i+u_i$ if $D=0$ (1)
$(y_i|D=1)=\kappa+\phi_x*X_i+u_i$ if $D=1$ (2)
$y_i=\mu + \nu*X_i + \pi*D+u_i$ (3)

When splitting the dataset in two parts, you have the following:

Yes, you have a better fit of the data than if you simply add the dummy: you have both an intercept and a slope that is group-specific.
Unfortunately, you have fewer observations, which cause your estimates to be less precisely estimated.
Compare the sum of the Residuals Sum of Squares of models 1 to 3. That is, $RSS_1+RSS_2<RSS_3$ You are better off splitting your dataset.
The $R^2$, however, is larger in model three than the weighted sum of $R^2$ of the other two models.

When fully interacting your model, it looks like this:

$y_i=\alpha+\beta_D*D+\beta_x*X_i+\beta_{Dx}*D*X_i+u_i$ (4)

You have the following:

Intercept and slope are group-specific
$RSS_1+RSS_2=RSS_4$: which means that the model (4) fits the data as well as two models.

Notice that splitting or fully interacting involves differences for the estimation of the variance-covariance matrix of regressors. When fully interacting, you also come across the problem that the number of regressors rapidly increases. These are issues you need to take into account, too.

Related Solutions

Solved – Categorical variable coding to compare all levels to all levels

You want all possible pairwise comparisons of levels, but there are much more pairs than there are degrees of freedom in the factor. Say the factor has five levels, then you need 4 parameters to code it, but there are $\binom{5}{2}$ pairs, that is, 10 pairs. So it is imposible to find a coding with one parameter for each comparison.

The solution is to use whatever coding you wants, and then compute the 10 pairwise contrasts afterwards, after estimating the model, from the model output. In R, for instance, this could be done many ways , either "by hand", or with the use of packages like contrast or multcomp.

Below an R example, done "by hand", for confidence intervals of all pairwise comparisions:

xfac  <-  factor(rep(1:5, each=10))
y     <-  rnorm(50, mean=c(rep(0, 20), rep(1, 30)), sd=2)

mod   <-  lm( y ~ 0 + xfac)

# generating a hypothesis contrasts matrix with 10 rows:
# each row is one contrast:      
cmat  <-  matrix(0, 10, 5)
nam   <-  character(length=10)   
row   <-  0
for (i in 1:4) for (j in (i+1):5)   {
                   row  <-  row+1
                   nam[row]  <-  paste("x[", i, "]-x[", j, "]", sep="")
                   cmat[row, c(i, j)]  <-  c(1, -1)
               }
rownames(cmat)  <-  nam  

# We write a contrast testing function by hand:
my.contrast  <-  function(mod,  cmat)  {
    co  <-  coef(mod)
    CV  <-  vcov(mod)
    se  <-  sqrt( diag( cmat %*% CV %*% t(cmat) ))
    df  <-  mod$df.residual
    contr  <-  cmat %*% co
    ul  <-  qt(0.975, df=df)
    ci  <-  cbind(contr-ul*se, contr+ul*se)
    ci
}

And then using it gives the result:

> my.contrast(mod, cmat)
               [,1]      [,2]
x[1]-x[2] -1.946376 1.7921298
x[1]-x[3] -3.044916 0.6935897
x[1]-x[4] -2.136283 1.6022227
x[1]-x[5] -2.301393 1.4371135
x[2]-x[3] -2.967793 0.7707130
x[2]-x[4] -2.059160 1.6793460
x[2]-x[5] -2.224269 1.5142368
x[3]-x[4] -0.960620 2.7778861
x[3]-x[5] -1.125729 2.6127769
x[4]-x[5] -2.034362 1.7041439

Solved – How to predict a categorical variable with another categorical variable

The easiest way is to break your data down into eight groups. You can do this by writing an occupation vector $x=(v_1,v_2,\cdots,v_7)$ where $v_i=1$ if you're in category $i$ and 0 otherwise. You need to set a reference category correponding to $v=(0,0,0,\cdots,0)$, say "lawyer". Then a simple logistic regression should do the job. Then you have a linear response: $Y=X\beta$ which you map with a logit function to a probability. This way you can see which categories influence more than others.