Overfitting in randomForest model in R, WHY

classificationmachine learningnatural languagerrandom forest

I am trying to train a Random Forest model in R for sentiment analysis. The model works with tf-idf matrix and learns from it how to classify a review, in positive or negative.

Positive ones are classified with label 1, and negative ones are classified with label 0.

I created a code in R based on the logic of training a model given 2 labels converted to factors, and then dividing the database in training and test datasets.

The matrix is with dimensions 502 x 5477.

What the model does is that also mtry = 5477.

I get accuracy = 1 as output of the code which is strange. Is the problem the logic of the code, or I should add parameters to limit mtry, etc?

# Load necessary libraries
library(randomForest)
library(readxl)
library(caret)
library(e1071)

# Load the original DataFrame with labels
df <- read_excel('~/Downloads/tfidf_r.xlsx')
df$label...2 <- as.factor(df$label...2)

# Split the data into training and testing sets
set.seed(42)
train_indices <- createDataPartition(df$review_id, p = 0.7, list = FALSE)
train_data <- df[train_indices, ]
test_data <- df[-train_indices, ]

# Initialize and train the Random Forest classifier with regularization
random_forest_model <- train(label...2 ~ ., data = train_data, method = "rf",
                             trControl = trainControl(method = "cv", number = 10))

# Print the model
print(random_forest_model)

# Make predictions on the test data
y_pred <- predict(random_forest_model, newdata = test_data)

Here you can see the output:

Accuracy : 1          
                 95% CI : (0.9757, 1)
    No Information Rate : 0.6067     
    P-Value [Acc > NIR] : < 2.2e-16  

Best Answer

I bet you have data leakage, as apparently you generated a single tf-idf matrix from the whole dataset, instead of training your model on a tf-idf matrix generated only from the "train" set. If you think about it, your test set is data that you're supposed to not have seen before, so it's illogical to include words coming from the test set into the tf-idf matrix used for training.

To solve this problem, randomly split the very first version of your original untouched dataset into train and test sets, and then create a tf-idf matrix for the train set. Train your model on it. Then you'll have to transform your test set so you can test the trained model on it.

The link I gave above explains how to do that in Python, so you can get an idea of the process. I don't know the details of how to do the same thing in R though, and it's probably more a question for stackoverflow as this part of the problem is really about programming.

Related Question