Solved – Random Forest in R – How to perform feature extraction and reach the best Accuracy result

accuracyimportancemachine learningrrandom forest

I'm working on a university project where I need to build a Random Forest model in R to predict if patients have depressive tendencies according to their EEG-data. I already preprocessed the data and built a general model for the forest. Currently, I'm fine-tuning it to receive the best possible prediction Accuracy. If I understand that correctly, I need to do some Feature Extraction (At this moment, I have 1584 Features) and tuning of the hyperparameters. But I'm not sure how to perform the Feature Extraction in R? Right now, I'm doing this:

library(dplyr)
library(caret)
library(randomForest)
library(eegkit)
library(rlist)
library(edfReader)
library(eegUtils)
library(e1071)
library(ggplot2)

yourdata_neu <- data.frame(df_test)
rownames(yourdata_neu) <- NULL

set.seed(42)
###############################Random Forest
for (t in 1:10) {
seed <- sample.int(10)
set.seed(seed)  
  seeds <- vector(mode = "list", length = 50)
  for(i in 1:50){
  seeds[[i]] <- sample.int(1000, 12)}

  ## For the last model:
  seeds[[50]] <- sample.int(1000, 1)

yourdata_neu$Depressiv <- as.factor(yourdata_neu$Depressiv)

inTraining <- createDataPartition(yourdata_neu$Depressiv[1:nrow(yourdata_neu)], p = 0.70, list = FALSE)
training <- yourdata_neu[inTraining,] 
testing <- yourdata_neu[-inTraining,]

train_control <- trainControl(method="cv", number=10, verboseIter = TRUE, seeds = seeds) 

model <- train(training[,1:ncol(yourdata_neu)-1],as.factor(training[,ncol(yourdata_neu)]), method = "rf", type="classification", metric= "Accuracy", maximize= TRUE, trControl = train_control, importance = TRUE) 

prediction2 <- predict(model, testing[,1:ncol(yourdata_neu)-1])

confusionMatrix(prediction2, as.factor(testing[,ncol(yourdata_neu)]),  positive = "1") 

model1 <- randomForest(training[,1:ncol(yourdata_neu)-1],as.factor(training[,ncol(yourdata_neu)]), 
type="classification", importance = TRUE, proximity = TRUE) 

prediction1 <- predict(model1, testing[,1:ncol(yourdata_neu)-1])
print(confusionMatrix(prediction1, as.factor(testing[,ncol(yourdata_neu)]),  positive = "1"))

Adding_columns <- NULL
varImp2 <- varImp(model, scale = TRUE)
Adding_columns <- t(varImp2$importance)
rownames(Adding_columns) <- paste0(rownames(Adding_columns),".", t)
Importance_Table <- rbind(Importance_Table, Adding_columns)
}

Importance_Table_Mean <- t(apply(Importance_Table, MARGIN = 2, function(x) mean(x, na.rm=TRUE)))
Importance_Table_Filter <- as.data.frame(Importance_Table_Mean)
Importance_Table_Filter <- Importance_Table_Filter[,Importance_Table_Filter< 50]
Importance_Table_Filter <- colnames(Importance_Table_Filter) 
Excluding_Channels <- names(yourdata_neu) %in% Importance_Table_Filter
yourdata_neu <- yourdata_neu[!Excluding_Channels]

  1. Run the randomForest 10 times and save each time the importance of every feature in the data frame "Importance_Table"
  2. Calculating the mean of the 10 trials for each feature in "Importance_Table_Mean"
  3. Use the data frame "Importance_Table_Filter" to save all features with an importance value of under 50 (50 was chosen by me)
  4. Get the colnames(feature names) of the data frame "Importance_Table_Filter" and save them in the variable "Excluding_Channels"
  5. Dropping all channels with an importance value of under 50 from my data set (yourdata_neu) and keeping the ones with more than 50.

But I get the feeling that this isn't a good approach and very subjective.
Does anybody have an idea to improve my model? My idea was to perform the Feature Extraction first and then optimizing the hyperparameters. It that common or advised to do it like this? I'm grateful for every input 🙂

EDIT: Providing the values inside of the variable "yourdata_neu" (It contains the df_test values):
https://drive.google.com/file/d/1k02hyqU51cAy5ka1gs5Ydr5_gOM5vTew/view?usp=sharing

Best Answer

You may be interested in using the recursive feature elimination (RFE) function in the caret package:

1. Eliminate highly correlated variables

# You can use any threshold you want to deem a correlation too high. Here we use .80
nonColinearData = yourdata_neu[, -findCorrelation(cor(yourdata_neu), cutoff = .8)]

2. Use recursive feature elimination

# Set RFE control
ctrl = rfeControl(functions = rfFuncs, # "rfFuncs" are built-in to caret
                  method = "repeatedcv", repeats = 10,
                  saveDetails = TRUE)
# By using rfFuncs, caret will use a random forest to evaluate the usefulness of a feature.

# Set a sequence of feature-space sizes to search over:
sizes = seq(sqrt(ncol(nonColinearData))*.5, ncol(nonColinearData), by = 5)
# note, this will fit hundreds of forests (not trees), so it may take a while.

# Use caret's rfe function to fit RF models to these different feature spaces
rfeResults = rfe(x = select(nonColinearData, -Depressiv), y = nonColinearData$Depressiv,
             sizes = sizes,
             rfeControl = ctrl)

3. Evaluate results:

rfeResults$results