Solved – random forest classification in R – no separation in training set

classificationrrandom forest

Originally posted on Stack Overflow, but suggested to move here…

I'm new to machine learning, but I've tried to perform a Random Forest classification (randomForest package in R) on some metabolomics data with bad results. My normal approach in this case would be to employ a PLS-DA strategy. However, I decided to try both RF and SVM as there are some publications highly recommending these machine learning approaches for Omics data.

In my case, 'X' is a 16*100 data frame (16 individuals with 100 recorded features/predictors) read from a CSV file. 'Y' is a factor vector (length=16) with 8 'high' and 8 'low'. In both PLS-DA and SVM (both linear and radial kernel) I get excellent separation. However, I get 3 misclassifications out of 16 in the RF model.

The RF model looks like: RFA1=randomForest(X,Y)

## read file and fix data frame
in.data = read.csv2(file='Indata.csv', header = FALSE, skip=5)[,-4] # Col 1-3 are identifiers. Features/predictors from col 4
names(in.data)=names(read.csv2(file='Indata.csv',header=T)[,-4])
# str(in.data)
 # $ ID                       : Factor w/ 27 levels "2","3","4","5",..: 2 3 4 6 8 10 20 23 5 11 ...
     # $ period                   : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
 # $ consumption          : Factor w/ 2 levels "high","low": 1 1 1 1 1 1 1 1 2 2 ...
     # $ FEATURES...

## Sort DF into X (features) and Y (classifier based on consumption)
y = in.data$consumption                   # Classifier based on high vs low consumption
x = in.data[,-1:-3]                       # 100 features/predictors into X NB Contains many NAs
nr=nrow(x)
nc=ncol(x)
x.na = as.data.frame(is.na(x))            # Find NAs in X
col.min=apply(x,2,min,na.rm=T)            # Find min value per feature (omitting NAs)
## Deal with zero/missing data-situation
x2=x                                      # Compute new x2 matrix without NA
for (i in 1:nc) {
    x2[x.na[,i],i]=col.min[i]             # Substitute missing data with col.min
}

## Make classifiers according to period (A vs B)
a.ind = in.data$period=='A'
    b.ind = in.data$period=='B'

## Choose data from period A only & transform/scale X
x2a=x2[a.ind,]                 # Original data
x2a.scale=scale(x2a)           # Scaled
x2a.log=log(x2a)               # Log-transformed
x2a.logscale=scale(log(x2a))   # Log-transformed and scaled
ya=y[a.ind]

## Perform analysis for period A
library(randomForest)
(rfa1=randomForest(x2a,ya))
(rfa2=randomForest(x2a.scale,ya))
(rfa3=randomForest(x2a.log,ya))
(rfa4=randomForest(x2a.logscale,ya))

This generates output like:

Call:
 randomForest(x = x2a, y = ya) 
               Type of random forest: classification
                 Number of trees: 500
No. of variables tried at each split: 10

        OOB estimate of  error rate: 18.75%
Confusion matrix:
     high low class.error
high    6   2       0.250
low     1   7       0.125

I have played around with both mtry (5-50) and ntree (500-2000) with no apparent success. I've also tried combinations of transforms and scaling of 'X'. But as I understand it, RF is a non-parametric method and as such, transformations and scaling won't do anything for the results.

For comparison, using the exact same data, PLS-DA using SIMCA13 provides excellent separation already in the 1st component. SVM using the kernlab package in R provides 0 training error. At this stage I'm not looking at validation or using test sets. I want to first make sure I get good classification on my training set.

I'm sure I'm missing something, but I don't know what. I hope to have supplied sufficient information to describe the problem.

Thanks in advance for any help!

Sincerely,

Calle

Best Answer

Because of your sample size, I would recommend using leave one out cross validation to estimate model fit. The algorithm goes like this:

  1. Take out an observation from the data set.
  2. Fit the model.
  3. Estimate the class label of the held out observation.
  4. Repeat steps 1-3 until all observations have been held out.

To see the algorithms performance, look at how well it fit the held out data.

You may find, as Simone suggested, that your algorithm is overfitting the data.

One thing that I have found is that you need much more than 16 observations to fit a good classification model.

Check out the cvTools package to perform the cross validation.

Related Question