I have a data set which is highly imbalanced and I have used the SMOTE algorithm (using the R package DMwR) to balance the binary class in the data set. I have been using the R Ada package to then train an Ada Boost model on this data set to predict the binary class, with very good results.
In the same data set, I have another class variable which has multiple values (6 in total). In this case I realise that I can't use the AdaBoost algorithm as implemented in the ada package as it only deals with the binary case.
I therefore have 2 problems:
-
I'd like to use the SMOTE algorithm on second class variable but this also only works with binary classes. Is there an algorithm or package I can use in R to "rebalance" a data set based on a class with multiple values in a similar way to SMOTE?
-
I'd like to use a classifier to predict the multiple class variable. I have tried using the one-vs-all approach with AdaBoost but I cannot get this to work well (my approach is below). Boosting seems to work well with this data set. Are there any other boosting algorithms or other approaches I could use in R that handle classes with multiple values. I have tried using Random Forest but one of my nominal inputs has too many discrete values to use it.
Approach for AdaBoost one-vs-all
- Build a vector with a binary variable for each discrete class value
- Train one AdaBoost model against each binary class vector
- Generate probability prediction for each AdaBoost model
- Select the class with the highest probability
Many thanks
Best Answer
You can use maboost package in R. It implements mutliclass boosting. Its multiclass boosting is in a sense the generalization of adaboost.MM. It directly solves the multiclass boosting without reduce it to binary classification problems and perhaps appropriate for your application.