Solved – Cross-validation with unbalanced-classes

cross-validationunbalanced-classesweka

I'm a little confused on how to manage my data set with WEKA.(for data mining)

I have a Dat set including 11377 record classified as follows:

  • 11111 records have class YES
  • 266 records have class NO

This is an unbalanced class, and if i start the classification process with WEKA, the results will be poor.
I want to use the Cross-validation with 10 fold for the classification of data
with J48 tree algorithm, but first i need to oversample my minority class? How i can prevent overfitting of data?

I would like to know how I should get this situation to get a good analysis.
Thanks in advance!

Best Answer

Start by making sure that you are performing the oversampling in the correct portion of your analysis. First you create a validation set that is not at all involved in the training by randomly pulling out some percentage of your data. Make sure that randomly sampled data has the correct proportion of YES and NO. Then you use the remaining data [only] (do NOT use the data in the validation set) to perform your oversampling and analysis.

Your randomly selected validation set is then used to determine the performance of your model. Check the selectivity and specificity of your results. If you want to get a general idea of how much overfitting there is you can see how the model performs against the training set and then compare that to how it performs against the validation set. The validation set will give your anticipated real world performance.

Essentially...I'm saying that you can check for overfitting by keeping a validation set separate and comparing against your training set. In order to actually prevent the overfitting you'll have to review your data and do feature selection carefully.

Related Question