Solved – RandomForestClassifier Parameter Optimization

binary datacross-validationrandom forest

I'm a ML novice and I'm wondering if someone can critique what i'm doing (this is a bit open-ended).

  • I have a very small corpus of text documents (n = 122).
  • There is a binary decision associated with each document.
  • I have created a "bag of words" representation of each document and I'm using python's RandomForestClassifier to make models to classify the data.

I'm tinkering with the parameters in the RandomForestClassifier in the following way:

  • Run the RandomForestClassifer on 200 random subsets of the data (n = 112 for each of the 200 runs) (these #s were chosen arbitrarily).
  • Rank the importance of each word in my bag of words matrix based on the average importance of each word in the 200 runs.

Now I want to see if there is an "optimal" # of feature/words for my data set using the RandomForestClassifier.
This is done as follows:

  • Generate 500 random forests (600 trees per forest). Each of the 500 forests uses 112 randomly chosen documents as a training set and the remaining 10 docs as a test set.
  • Measure the average accuracy of these 500 forests as a function of # of words/features used to generate the models.

Here is what I see. The optimal average accuracy is around n=80 words/features.
enter image description here

Questions:

  • I'm sure my approach is unorthodox. Is there a better way to optimize the RandomForest parameters?
  • Is there any "intuitive" explanation for why my average accuracy seems to be optimal at around 80 words and then tails off? Is it simply that when n-features gets too large, my forests don't incorporate enough of the good features and so accuracy suffers?
  • Any other parameters that are worth modifying here?
  • Any other classification models worth looking at?

Thank you for any thoughts.

Best Answer

To answer your second question, why accuracy tails off, I put together an example in R that should resemble your problem. I generated ~50 good predictors and ~1000 bad predictors (that are just randomly assigned dummy variables). I start by increasing the number of good predictors, and then after maxing those out I incrementally add in all of the bad predictors.

This illustrates what you observe in your data - up to a point the predictors are good and adding value, then at some point you're adding in the worse features and they start to drown out the good features.

Accuracy vs. features

The (admittedly messy) code is below:

library(data.table)
library(randomForest)
set.seed(343)
y <- sample(c(0,1), size=1500, replace=TRUE, prob=c(.8,.2))

pct_seq <- seq(.2,.1,by=-.002)

good.x <- sample(c(1,0), size=1500, replace=TRUE, prob=c(.21,.79))
for(i in pct_seq) {
    samp1 <- sample(c(1,0), size=1500, replace=TRUE, prob=c(i,1-i))
    samp0 <- sample(c(1,0), size=1500, replace=TRUE, prob=c(i/5,1-i/5))
    good.x <- cbind(good.x,ifelse(y==1,samp1, samp0))
}

pct_seq <- rep(.02,1000)

bad.x <- sample(c(1,0), size=1500, replace=TRUE, prob=c(.01,.99))
for(i in pct_seq) {
    samp1 <- sample(c(1,0), size=1500, replace=TRUE, prob=c(i,1-i))
    samp0 <- sample(c(1,0), size=1500, replace=TRUE, prob=c(i,1-i))
    bad.x <- cbind(bad.x,ifelse(y==1,samp1, samp0))
}

x <- cbind(good.x,bad.x)
y.fac <- as.factor(y)

var.seq <- c(seq(11,51, by=10), seq(151,951, by=100))
model.results <- data.frame(0,0)

for (j in var.seq) {
    print(j)
    print(randomForest(x[,1:j],y.fac,ntree=1000))
}