Solved – Effect of categorical interaction terms with random forest machine learning algorithm

categorical dataclassificationinteractionmachine learningrandom forest

Thanks in advance for the help.

I have moderately large dataset (around 7000 samples) with numerous categorical predictors and a single binary response. All of the predictors are categorical. Through trial and error I have found that using a random forest model produces the best predictive accuracy of the response.

In order to increase the accuracy of my model I decided to include pairwise interaction terms. However, doing so led to a decrease in accuracy. I initially assumed this to be because including interaction terms adds a significant number of predictors to my training set. Specifically, before adding interaction terms I have 12 predictors and after adding them around 60. With dummy encoding I have approximately 60 predictors and a few hundred, respectfully.

With this in mind I performed feature selection and picked the 5,10,15,etc best predictors to train my model. This had little effect on the accuracy of my model, namely it wasn't as good as the model with no interaction terms. Notice that I optimized for trees in the forest and the number of features used in random selection.

At this point one might conclude that interaction terms don't add any anything to my model. However, I have good evidence that they should be very strong predictors of the response. Ideally I would be able to have interactions more complicated than just pairwise, but I don't have enough samples to do this.

I am wondering now if this behavior is a result of the method I am using, namely I am using a forest of trees. Suppose for the moment that I am using only one tree. Is it possible that using categorical data to train a tree type model inherently encodes interactions? If so, this might explain why including the interaction terms leads to a decrease in the predictive performance of my model.

Performing feature selection results in picking almost exclusively interaction predictors. When I produce a random forest using these predictors I am then limiting my "path". I'll motivate this with an example. Suppose I have predictors $w$,$x$,$y$,$z$ and using feature selection I settle on the interaction predictors $wx$ and $yz$. I can no longer produce a tree that which branches at $x$ and then at $y$. I can, however, produce such a tree using the original predictors without interactions.

Here is my hypothesis. Producing a random forest with just the original predictors allows me to "encode" interactions through the branching behavior of trees. Using only the results of feature selection does not perform as well since the features selected are predominately interactions. This limits the the branching behavior of the generated trees. Finally, including everything doesn't perform as well since this results in just way too many predictors. I don't have enough samples for the model to train properly. I should also note that I looked at using the original predictors and a handful of the best interactions, as determined by feature selection. This also did not perform as well as using only the original predictors.

Here is my question. Am I looking at this the right way? More specifically, does generating a tree using only categorical predictors inherently encode the interactions between those predictors? Also any insight or advice on how to alternatively tackle this problem (predicting the binary response) would much appreciated.

Thank you.

Best Answer

I think your questions are very interesting, I spend some of my time looking at the effective mapping curvature of random forest(RF) model fits. RF can capture some orders of interactions depending on the situation. x1 * x2 is a two-way interaction and so on... You did not write how many levels your categorical predictors had. It matters a lot. For continous variables(many levels) often no more than multiple local two-way interactions can be captured. The problem is, that the RF model itself only splits and do not transform data. Therefore RF is stuck with local uni-variate splits which is not optimal for captivating interactions. Therefore RF is fairly shallow compared to deep-lerning. In the complete other end of the spectre are binary features. I did not know how deep RF can go, so I ran a grid-search simulation. RF seems to capture up to some 4-8 orders of interactions for binary features. I use 12 binary variables and 100 to 15000 observations. E.g. for the 4th order interaction, the prediction vector y is:

orderOfInteraction = 4
y = factor(apply(X[,1:orderOfInteraction],1,prod))

where any element of X either is -1 or 1 and the product of the first four variable columns of X is the prediction. All four variables are completely complimentary. Therefore, no main-effects, 2nd or 3rd order effects. The OOB prediction error will therefore reflect only how well RF can captivate an interaction of the Nth order.

Things which makes RF captivate higher order of interactions: plenty of observation, few levels in variables, few variables

Limiting factors for RF captivating higher orders: the opposite of above, limited sampsize, limited maxnodes and redundant/sufficient lower order information.

The last one means that if RF can find the same information in low-order interactions, there is, so to say, no need to go deeper. Information may not even be redundant. It just have to be sufficient for RF to make correct binary predictions.

Depth of random forest: OOB err.rate vs. observations vs. order of interaction Depth of random forest: OOB err.rate vs. observations vs. order of interaction

  rm(list=ls())
  library(randomForest)
  library(parallel)
  library(rgl)

  simulate.a.forest = function(std.pars,ite.pars) {
    #Merge standard parameters with iterated parameters
    run.pars = c(std.pars,ite.pars)

    #simulate data of a given order
    X = replicate(run.pars$vars,sample(c(-1,1),run.pars$obs,replace=T))
    y = factor(apply(X[,1:run.pars$intOrder],1,prod))

    #run forest with run.pars, pars with wrong name is ignored
    rfo = do.call(randomForest, run.pars)

    #Fetch OOB error.rate and return
    out = rev(rfo$err.rate[,1])[1] #fetch error rate from object
    names(out) = paste(ite.pars,collapse="-")[1]
    return(out)
  }

  ## Lets try some situations (you can also pass arguments to randomForest here)
  intOrders = c(2,3,4,5,6,12) #hidden signal is a N-way interaction of Nth order
  obss = c(100,500,1000,3500,7000,15000) #available observations

  ## Produce list of all possible combinations of parameters
  ite.pars.matrix = expand.grid(intOrder=intOrders,obs=obss)
  n.runs = dim(ite.pars.matrix)[1]
  ite.pars.list   = lapply(1:n.runs, function(i) ite.pars.matrix[i,])

  i=1 ##for test-purposes
  out = mclapply(1:n.runs, function(i){
    #line below can be run alone without mclapply to check for errors before going multicore
    out = simulate.a.forest(std.pars=alist(x=X,y=y,
                                           ntree=250,
                                           vars=12),
                                           #sampsize=min(run.pars$obs,2000)),
                            ite.pars=ite.pars.list[[i]])
    return(out)
  })

  ##view grid results
  persp3d(x = intOrders,xlab="Nth order interaction",
          y = log(obss,base=10),ylab="10log(observations)",
          z = matrix(unlist(out),nrow=length(intOrders)),zlab="OOB prediction error, binary target",
          col=c("grey","black"),alpha=.2)

  rgl.snapshot(filename = "aweSomePlot.png", fmt = "png", top = TRUE)