Solved – unbalanced samples random Forests

rrandom forestunbalanced-classes

I am trying to predict species presence or absence using randomForest in R (classification). In fact, I am trying to do it for several species, in separate models.

For a couple of the species, the training data are quite unbalanced e.g., 70 observations of species presence, and and 6500 observations of species absence.

This is my code:

#read in data frame containing observations of species presence/absence and predictor     variables
mydata <- read.csv('mydata.csv')

#fit random forests model
fitmodelA <- randomForest(SPECIESA ~ var1 + var2 + var3 + var4 + var5 +var6 + var7 +   var8 + var9 + var10, data=mydata, mytry=3, ntrees=500, replace=TRUE, importance=TRUE,   keep.forest=TRUE)

#predict to new data
predictmodel <- predict(fitmodelA, newdata, type="prob")

In the output prediction, almost the entire study area is predicted with prob > 0.7. I take this to be predictions of species occurrence? or is it the probability of species absence?

I want to try to balance the data by forcing the model to select equal sample sizes from observations of presence and absence, e.g., adding the argument

sampsize(70,70)

But I get the error message "Error in if (ncol(x) != ncol(xtest)) stop("x and xtest must have same number of columns")"

What am I doing wrong here?

Best Answer

you probably want

sampsize(c(70,70))

You can also play with class weights which influence the gini impurity function for picking splits. Check out this paper