Decision Tree – How to Train a Decision Tree Against Unbalanced Data?

accuracycartclassificationunbalanced-classes

I'm new to data mining and I'm trying to train a decision tree against a data set which is highly unbalanced. However, I'm having problems with poor predictive accuracy.

The data consists of students studying courses, and the class variable is the course status which has two values – Withdrawn or Current.

  • Age
  • Ethnicity
  • Gender
  • Course
  • Course Status

In the data set there are many more instances which are Current than Withdrawn. Withdrawn instances only accounting for 2% of the total instances.

I want to be able to build a model which can predict the probability that a person will withdraw in the future. However when testing the model against the training data, the accuracy of the model is terrible.

I've had similar issues with decision trees where the data is dominated by one or two classes.

What approach can I use to solve this problem and build a more accurate classifier?

Best Answer

This is an interesting and very frequent problem in classification - not just in decision trees but in virtually all classification algorithms.

As you found empirically, a training set consisting of different numbers of representatives from either class may result in a classifier that is biased towards the majority class. When applied to a test set that is similarly imbalanced, this classifier yields an optimistic accuracy estimate. In an extreme case, the classifier might assign every single test case to the majority class, thereby achieving an accuracy equal to the proportion of test cases belonging to the majority class. This is a well-known phenomenon in binary classification (and it extends naturally to multi-class settings).

This is an important issue, because an imbalanced dataset may lead to inflated performance estimates. This in turn may lead to false conclusions about the significance with which the algorithm has performed better than chance.

The machine-learning literature on this topic has essentially developed three solution strategies.

  1. You can restore balance on the training set by undersampling the large class or by oversampling the small class, to prevent bias from arising in the first place.

  2. Alternatively, you can modify the costs of misclassification, as noted in a previous response, again to prevent bias.

  3. An additional safeguard is to replace the accuracy by the so-called balanced accuracy. It is defined as the arithmetic mean of the class-specific accuracies, $\phi := \frac{1}{2}\left(\pi^+ + \pi^-\right),$ where $\pi^+$ and $\pi^-$ represent the accuracy obtained on positive and negative examples, respectively. If the classifier performs equally well on either class, this term reduces to the conventional accuracy (i.e., the number of correct predictions divided by the total number of predictions). In contrast, if the conventional accuracy is above chance only because the classifier takes advantage of an imbalanced test set, then the balanced accuracy, as appropriate, will drop to chance (see sketch below).

Accuracy vs. balanced accuracy

I would recommend to consider at least two of the above approaches in conjunction. For example, you could oversample your minority class to prevent your classifier from acquiring a bias in favour the majority class. Following this, when evaluating the performance of your classifier, you could replace the accuracy by the balanced accuracy. The two approaches are complementary. When applied together, they should help you both prevent your original problem and avoid false conclusions following from it.

I would be happy to post some additional references to the literature if you would like to follow up on this.

Related Question