Best Model Selection for Data Prediction Using R and Machine Learning with Caret

caretmachine learningr

I'm exploring some ML strategies using caret package. My goal is to select best predictors and to obtain optimal model for further predictions. My dataset is:

  • 75 observations (39 S and 36 F – dependent variable named 'group') – dataset is well balanced

  • 13 independent variables (predictors), continous values from 0 to 1, without any NA's named:

    A_1, A_2, A_3, A_4, A_5, B_1, B_2, C_1, C_2, C_3, D_1, D_2, E_1

Moreover, values of each predictor (F vs S) significantly differ (Wilcoxon test).

I started division of the data and 10-fold cross validation:

set.seed(355)
trainIndex <- createDataPartition(data$group, p = 0.7, list = FALSE)
trainingSet <- data[trainIndex,]
testSet <- data[-trainIndex,]

methodCtrl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 5,
  savePredictions = "final",
  classProbs = T,
  summaryFunction = twoClassSummary
)

Then, based on several articles and tutorials I selected six ML methods to obtain some models with all predictor variables:

rff <- train(group ~., data = trainingSet, method = "rf", metric = "ROC", trControl = methodCtrl)
nbb <- train(group ~., data = trainingSet, method = "nb", metric = "ROC", trControl = methodCtrl)
glmm <- train(group ~., data = trainingSet, method = "glm", metric = "ROC", trControl = methodCtrl)
nnett <- train(group ~., data = trainingSet, method = "nnet", metric = "ROC", trControl = methodCtrl)
glmnett <- train(group ~., data = trainingSet, method = "glmnet", metric = "ROC", trControl = methodCtrl)
svmRadiall <- train(group ~., data = trainingSet, method = "svmRadial", metric = "ROC", trControl = methodCtrl)

How accurate are the models?

fitted <- predict(rff, testSet)
confusionMatrix(reference = factor(testSet$group), data = fitted, mode = "everything", positive = "F")#61 #61
fitted <- predict(nbb, testSet)
confusionMatrix(reference = factor(testSet$group), data = fitted, mode = "everything", positive = "F")#66 #66
fitted <- predict(glmm, testSet)
confusionMatrix(reference = factor(testSet$group), data = fitted, mode = "everything", positive = "F")#57 #66
fitted <- predict(nnett, testSet)
confusionMatrix(reference = factor(testSet$group), data = fitted, mode = "everything", positive = "F")#42 #66
fitted <- predict(glmnett, testSet)
confusionMatrix(reference = factor(testSet$group), data = fitted, mode = "everything", positive = "F")#61 #57
fitted <- predict(svmRadiall, testSet)
confusionMatrix(reference = factor(testSet$group), data = fitted, mode = "everything", positive = "F")#66 #66

After first # I put the % of accuracy of prediction of each model. I also draw simple ROC comparison of all models:
enter image description here

Now I'd like to improve my model, so I used glmStepAIC to get only best (most important) predictors, here's what I got:

aic <- train(group ~., data = trainingSet,  method = "glmStepAIC", trControl = methodCtrl, metric = "ROC", trace = FALSE)
summary(aic$finalModel)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0191  -0.6077   0.3584   0.6991   2.5416  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  0.39809    1.01733   0.391  0.69557   
A_5          0.11726    0.04701   2.494  0.01263 * 
C_2          0.17789    0.11084   1.605  0.10852   
C_3         -0.18231    0.11027  -1.653  0.09828 . 
E_1         -0.14176    0.05260  -2.695  0.00704 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 74.786  on 53  degrees of freedom
Residual deviance: 48.326  on 49  degrees of freedom
AIC: 58.326

Number of Fisher Scoring iterations: 5

Based on this result I chose this 4 predictor variables:

data <- read.table("data.txt", sep ='\t',header = T, dec = ',')
data <- data[,c('group','A_5','C_2','C_3','E_1')]

And I repeated everything, data division train – test, model obtain, model testing etc only with these 4 predictors instead of all 13. Unfortunately the accuracy is still low, take a look of % after second # in confusionMatrix part. Moreover the ROC comparison is even worse:

enter image description here

I'm a new in such analysis, so could you please tell me if I'm making some bad mistake in my analysis or maybe my dataset is to small/my data is rubish?
How can I choose optimal predictors to get best model? Which ML methos should I pick?

Best Regards,
Adam

Best Answer

1.) We do not know much about your data. We only see feature amount and row size. Keep in mind that data cleaning and preprocessing can vastly impact all of your prediction results.

2.) David is right to the extent that 13 predictors with 75 observations is a little bit gruesome. So Lasso is an obvious preprocessing step. The other question is, if you are only interested in prediction or inference? If you are interested in prediction, than I would specifically look at my last hint. If you are interested in inference, than you have to keep in mind that multicollinearity will distort your feature/permutation importance of your predictiors. So in this case Lasso is extremely recommend, since it will limit the amount of model solutions due to random seeds. And this can limit the order or impact of feature/permutation importance greatly, which is good. A feature importance selection as you did can be already distorted by multicollinearity, I would first do a regularization method.

3.) Getting the feature importance as you did, is ok (even despite not using reg. methods) . Since it will limit the complexity of the model. You have less features, but a little bit less prediction quality. This is not unnormal. This is still in the range of possible outcomes, you are not using every trash (sometimes noise) in he prediction. Your models can still capture a lot of the variance although you are having only 4 predictors instead of 13. So this step is ok, but:

4.) What about your quality insights from confusion matrix. We can not judge if a model you develop should have e.g. high precision and/or recall. Or to to fromulate it more practical: is it ok to have more false positives than false negatives?. You know what I mean

5.) Looking at your classifiers, I would recommend trying something from the gradient boosting family. Most of the time they outperform your methods at least in terms of prediction. These ML methods are:

You can see how the EBM outperforms on ROC/AUC several other boosting techniques: enter image description here

Paper: https://arxiv.org/abs/1909.09223

However, none of these methods can deal with multicollinearity when it comes to inference. The models are only immune to multicoll. when it comes to prediction: https://datascience.stackexchange.com/questions/12554/does-xgboost-handle-multicollinearity-by-itself If you only care for prediction and you do lasso/ filter out already the best predictors. Than you are on the right track.

Related Question