Solved – Regression with Lots of Categorical Variables

categorical datacategorical-encodingmany-categoriesregression

I'm facing a regression task with many categorical and few numeric features. I encoded them into dummies and removed the first dummy column for each feature.

I am not getting very good R2 at all. I am wondering if, aside from creating dummies, there are any special strategies in these situations related to having so many cat feature.

For examples, are any regressors perhaps more suited to dummy variables?

Best Answer

$R^2$ isn't a good measure of model quality. It is a value devoid of context. Imagine you had an $R^2$ of 0.11. Now let us assume that you are modeling human behavior. Think about how many behaviors you do in a day. Consider how many distinct categories of behaviors that you do. Now consider the diversity of people on the planet. Add to that the differences in environment, education, and languages. Consider the various incentives in place. Now think about the population size. Depending on what you are doing an $R^2$ of 0.11 may be huge. You have accounted for 11% of observed behavior. On the other hand, for a well-understood process, such as aircraft design, if your $R^2$ is 0.11 then you are probably an undergraduate engineering student learning how to design things.

A second issue is that categories overlap and are associated with one another. Certain things are unlikely to happen together. A firm with a huge operating margin and a high volume of sales is unlikely to have a low relative income. Even with millions of observations, it would be unsurprising to find that some joint categories have no observations at all in them. Imagine there is a category with a size of one or two, how is that impacting your parameter estimates?

Consider gender as a category with a variable being a count of children they gave birth to. What would the male parameter related to that imply as the regression would calculate one? What if there was an encoding error and some men gave birth?

Consider three possible solutions, either to use an information criterion, though you shouldn't use AIC or BIC you should look for one appropriate to your problem, or consider using Bayes factors or Bayesian model selection. I would suggest an information criterion over the other two. Bayesian models do not appear to be what you are doing because Bayesian hypotheses are combinatoric. They are also calculation intensive. The information criteria map to the supremum of a set of stylized Bayesian posteriors.

If the set of independent variates is very large and a combinatoric solution isn't feasible, then consider a step-wise regression solution. There are good reasons to not use step-wise regression, but a good reason is that it is too large.

Finally, just create a multi-dimensional list of combinations of variables. Look at them. Are any of the counts so small that it is probably creating a poor model? Leave out one variable at a time, is there any set that improves the worst case count variables markedly? Are any variables so logically related that they are almost the same variable? Honestly, logic would be your greatest friend here.

If the set is too large to visually construct the categories, such as if you had millions of joint category intersections, then consider measures of association to reduce your set. You are wanting as much variability in your model as possible so you are wanting to toss categorical or ordinal variables that are strongly associated.

Unfortunately, there isn't an automatic clean answer to your problem. Some reduction tools such as principal components analysis or factor analysis probably will not work as intended because overlaps in categorical variables can have the effect of forcing orthogonality, depending on how the specifics of what you are doing.

Related Solutions

Solved – Effect of categorical interaction terms with random forest machine learning algorithm

I think your questions are very interesting, I spend some of my time looking at the effective mapping curvature of random forest(RF) model fits. RF can capture some orders of interactions depending on the situation. x1 * x2 is a two-way interaction and so on... You did not write how many levels your categorical predictors had. It matters a lot. For continous variables(many levels) often no more than multiple local two-way interactions can be captured. The problem is, that the RF model itself only splits and do not transform data. Therefore RF is stuck with local uni-variate splits which is not optimal for captivating interactions. Therefore RF is fairly shallow compared to deep-lerning. In the complete other end of the spectre are binary features. I did not know how deep RF can go, so I ran a grid-search simulation. RF seems to capture up to some 4-8 orders of interactions for binary features. I use 12 binary variables and 100 to 15000 observations. E.g. for the 4th order interaction, the prediction vector y is:

orderOfInteraction = 4
y = factor(apply(X[,1:orderOfInteraction],1,prod))

where any element of X either is -1 or 1 and the product of the first four variable columns of X is the prediction. All four variables are completely complimentary. Therefore, no main-effects, 2nd or 3rd order effects. The OOB prediction error will therefore reflect only how well RF can captivate an interaction of the Nth order.

Things which makes RF captivate higher order of interactions: plenty of observation, few levels in variables, few variables

Limiting factors for RF captivating higher orders: the opposite of above, limited sampsize, limited maxnodes and redundant/sufficient lower order information.

The last one means that if RF can find the same information in low-order interactions, there is, so to say, no need to go deeper. Information may not even be redundant. It just have to be sufficient for RF to make correct binary predictions.

Depth of random forest: OOB err.rate vs. observations vs. order of interaction

  rm(list=ls())
  library(randomForest)
  library(parallel)
  library(rgl)

  simulate.a.forest = function(std.pars,ite.pars) {
    #Merge standard parameters with iterated parameters
    run.pars = c(std.pars,ite.pars)

    #simulate data of a given order
    X = replicate(run.pars$vars,sample(c(-1,1),run.pars$obs,replace=T))
    y = factor(apply(X[,1:run.pars$intOrder],1,prod))

    #run forest with run.pars, pars with wrong name is ignored
    rfo = do.call(randomForest, run.pars)

    #Fetch OOB error.rate and return
    out = rev(rfo$err.rate[,1])[1] #fetch error rate from object
    names(out) = paste(ite.pars,collapse="-")[1]
    return(out)
  }

  ## Lets try some situations (you can also pass arguments to randomForest here)
  intOrders = c(2,3,4,5,6,12) #hidden signal is a N-way interaction of Nth order
  obss = c(100,500,1000,3500,7000,15000) #available observations

  ## Produce list of all possible combinations of parameters
  ite.pars.matrix = expand.grid(intOrder=intOrders,obs=obss)
  n.runs = dim(ite.pars.matrix)[1]
  ite.pars.list   = lapply(1:n.runs, function(i) ite.pars.matrix[i,])

  i=1 ##for test-purposes
  out = mclapply(1:n.runs, function(i){
    #line below can be run alone without mclapply to check for errors before going multicore
    out = simulate.a.forest(std.pars=alist(x=X,y=y,
                                           ntree=250,
                                           vars=12),
                                           #sampsize=min(run.pars$obs,2000)),
                            ite.pars=ite.pars.list[[i]])
    return(out)
  })

  ##view grid results
  persp3d(x = intOrders,xlab="Nth order interaction",
          y = log(obss,base=10),ylab="10log(observations)",
          z = matrix(unlist(out),nrow=length(intOrders)),zlab="OOB prediction error, binary target",
          col=c("grey","black"),alpha=.2)

  rgl.snapshot(filename = "aweSomePlot.png", fmt = "png", top = TRUE)

Solved – Dumthe coding categorical variables with lots of unique values using log2

It's spelled out in the docstring:

Binary encoding for categorical variables, similar to onehot, 
but stores categories as binary bitstrings.

A bitstring is a binary integer, so all digits are zeros and ones. The base two logarithm, rounded up, of an integer is the number of digits in its base two representation.

n = 3, binary = 11, log2 = 1.58
n = 7, binary = 111, log2 = 2.81
n = 11, binary = 1011, log2 = 3.46
n = 55, binary = 110111, log2 = 5.78
n = 78, binary = 1001110, log2 = 6.29

So the code is calculating the base two logarithm to determine how many bits it needs to use to encode all of the categories.

I have a variable called "City" that contains 87 unique values. If I dummy code them, I would then increase the number of columns by 87, so that each column represented whether that row included the city (1) or not(0). Using this encoder, I only get 7 columns, how do I know which cities are represented?

They are all represented, not just seven.

In a standard encoding, you would create 87 new columns. For each city, exactly one column would be one, and all 86 other columns would be zero. In this encoding, each city is represented by a unique combination of zeros and ones in the seven new columns.

Best Answer

Related Solutions

Solved – Effect of categorical interaction terms with random forest machine learning algorithm

Solved – Dumthe coding categorical variables with lots of unique values using log2

Related Question