Originally posted on Stack Overflow, but suggested to move here…
I'm new to machine learning, but I've tried to perform a Random Forest classification (randomForest package in R) on some metabolomics data with bad results. My normal approach in this case would be to employ a PLS-DA strategy. However, I decided to try both RF and SVM as there are some publications highly recommending these machine learning approaches for Omics data.
In my case, 'X' is a 16*100 data frame (16 individuals with 100 recorded features/predictors) read from a CSV file. 'Y' is a factor vector (length=16) with 8 'high' and 8 'low'. In both PLS-DA and SVM (both linear and radial kernel) I get excellent separation. However, I get 3 misclassifications out of 16 in the RF model.
The RF model looks like: RFA1=randomForest(X,Y)
## read file and fix data frame
in.data = read.csv2(file='Indata.csv', header = FALSE, skip=5)[,-4] # Col 1-3 are identifiers. Features/predictors from col 4
names(in.data)=names(read.csv2(file='Indata.csv',header=T)[,-4])
# str(in.data)
# $ ID : Factor w/ 27 levels "2","3","4","5",..: 2 3 4 6 8 10 20 23 5 11 ...
# $ period : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
# $ consumption : Factor w/ 2 levels "high","low": 1 1 1 1 1 1 1 1 2 2 ...
# $ FEATURES...
## Sort DF into X (features) and Y (classifier based on consumption)
y = in.data$consumption # Classifier based on high vs low consumption
x = in.data[,-1:-3] # 100 features/predictors into X NB Contains many NAs
nr=nrow(x)
nc=ncol(x)
x.na = as.data.frame(is.na(x)) # Find NAs in X
col.min=apply(x,2,min,na.rm=T) # Find min value per feature (omitting NAs)
## Deal with zero/missing data-situation
x2=x # Compute new x2 matrix without NA
for (i in 1:nc) {
x2[x.na[,i],i]=col.min[i] # Substitute missing data with col.min
}
## Make classifiers according to period (A vs B)
a.ind = in.data$period=='A'
b.ind = in.data$period=='B'
## Choose data from period A only & transform/scale X
x2a=x2[a.ind,] # Original data
x2a.scale=scale(x2a) # Scaled
x2a.log=log(x2a) # Log-transformed
x2a.logscale=scale(log(x2a)) # Log-transformed and scaled
ya=y[a.ind]
## Perform analysis for period A
library(randomForest)
(rfa1=randomForest(x2a,ya))
(rfa2=randomForest(x2a.scale,ya))
(rfa3=randomForest(x2a.log,ya))
(rfa4=randomForest(x2a.logscale,ya))
This generates output like:
Call:
randomForest(x = x2a, y = ya)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 10
OOB estimate of error rate: 18.75%
Confusion matrix:
high low class.error
high 6 2 0.250
low 1 7 0.125
I have played around with both mtry (5-50) and ntree (500-2000) with no apparent success. I've also tried combinations of transforms and scaling of 'X'. But as I understand it, RF is a non-parametric method and as such, transformations and scaling won't do anything for the results.
For comparison, using the exact same data, PLS-DA using SIMCA13 provides excellent separation already in the 1st component. SVM using the kernlab package in R provides 0 training error. At this stage I'm not looking at validation or using test sets. I want to first make sure I get good classification on my training set.
I'm sure I'm missing something, but I don't know what. I hope to have supplied sufficient information to describe the problem.
Thanks in advance for any help!
Sincerely,
Calle
Best Answer
Because of your sample size, I would recommend using leave one out cross validation to estimate model fit. The algorithm goes like this:
To see the algorithms performance, look at how well it fit the held out data.
You may find, as Simone suggested, that your algorithm is overfitting the data.
One thing that I have found is that you need much more than 16 observations to fit a good classification model.
Check out the cvTools package to perform the cross validation.