Bayesian – How Are Artificially Balanced Datasets Corrected For Unbalanced Classes?

bayesianclassificationdatasetposteriorunbalanced-classes

I came across the following in Pattern Recognition and Machine Learning by Christopher Bishop

A balanced data set in which we have selected equal numbers of examples from each of the classes would allow us to find a more accurate model.
However, we then have to compensate for the effects of our modifications to
the training data
. Suppose we have used such a modified data set and found models for the posterior probabilities. From Bayes’ theorem, we see that the posterior probabilities are proportional to the prior probabilities, which we can interpret as the fractions of points in each class. We can therefore simply take the posterior probabilities obtained from our artificially balanced data set and first divide by the class fractions in that data set and then multiply by the class fractions in the population to which we wish to apply the model. Finally,
we need to normalize to ensure that the new posterior probabilities sum to one.

I don't understand what the author intends to convey in the bold text above – I understand the need for balancing, but not how the "compensation for modification to training data" is being made.

Could someone please explain the compensation process in detail, and why it is needed – preferably with a numerical example to make things clearer? Thanks a lot!


P.S.
For readers who want a background on why a balanced dataset might be necessary:

Consider our medical X-ray problem again, and
suppose that we have collected a large number of X-ray images from the general population for use as training data in order to build an automated screening
system. Because cancer is rare amongst the general population, we might find
that, say, only 1 in every 1,000 examples corresponds to the presence of cancer. If we used such a data set to train an adaptive model, we could run into
severe difficulties due to the small proportion of the cancer class. For instance,
a classifier that assigned every point to the normal class would already achieve
99.9% accuracy and it would be difficult to avoid this trivial solution. Also,
even a large data set will contain very few examples of X-ray images corresponding to cancer, and so the learning algorithm will not be exposed to a
broad range of examples of such images and hence is not likely to generalize
well.

Best Answer

I have practical experience with training classifiers from imbalanced training sets. There are problems with this. Basically, the variances of the parameters associated with the less frequent classes - these variances grow large. The more uneven the prior distribution is in the training set, the more volatile your classifier outcomes become.

My best practice solution - which works well for probabilistic classifiers - is to train from a completely balanced training set. This means that you have about equally many examples of each class or category. This classifier training on a balanced training set must afterwards be calibrated to the correct distribution in the application domain, in your case a clinical setting. That is - you need to incorporate the skewed real-world prior distribution into the outcome probabilities of your classifier.

The following formula does precisely this by correcting for the lack of skewness in the training set:

$ \begin{split} &P_{corrected}(class=j \mid {\bf x}) = \\ &\frac{\frac{P_{corrected}(class=j)}{P_{balanced}(class=j)}\; P_{balanced}(class=j \mid {\bf x})}{\frac{P_{corrected}(class=j)}{P_{balanced}(class=j)}\; P_{balanced}(class=j \mid {\bf x}) + \frac{1-P_{corrected}(class=j)}{1-P_{balanced}(class=j)}\; \left(1- P_{balanced}(class=j \mid {\bf x}) \right) } \end{split} $

In the above formula, the following terms are used:

$P_{balanced}(class=j)$ the prior probability that outcome $j$ occurs in your balanced training set, e.g. probability of 'No-Tumor', which would be around $0.5$ in a two-class situation, around $0.33$ in a three-class classification domain, etc.

$P_{corrected}(class=j)$ the prior probability that outcome $j$ occurs in your real-world domain, e.g. true probability of 'Tumor' in your clinical setting

$P_{balanced}(class=j \mid {\bf x})$ is the outcome probability (the posterior probability) of your classifier trained with the balanced training set.

$P_{corrected}(class=j \mid {\bf x})$ is the outcome probability (the posterior probability) of your classifier correctly adjusted to the clinical setting.

Example
Correct posterior probability from classifier trained on a balanced training set to domain-applicable posterior probability. We convert to a situation where 'cancer' occurs in only 1% of the images presented to our classifier software:

$ \begin{split} &P_{corrected}(cancer \mid {\bf x}) = &\frac{\frac{0.01}{0.5}\; 0.81} {\frac{0.01}{0.5}\; 0.81 + \frac{1-0.01}{1-0.5}\; \left(1- 0.81 \right) } &=0.04128 \end{split} $

Derivation of correction formula

We use a capital $P$ to denote a probability (prior or posterior) and a small letter $p$ to indicate a probability density. In image processing, the pixel values are usually assumed to approximately follow a continuous distribution. Hence, the Bayes classifier is calculated using probability densities.

Bayes formula (for any probabilistic classifier)

$ P(class=j \mid {\bf x}) = \frac{P(class=j) \; p({\bf x} \; \mid \; class=j)} {P(class=j) \; p({\bf x} \; \mid \; class=j) + P(class \neq j) \; p({\bf x} \; \mid \; class \neq j)} $

where the 'other' classes than $j$ are grouped altogether ($class \neq j$).

From Bayes general formula follows, after rearrangement

$ p({\bf x} \mid class=j) = \frac{P(class=j \; \mid \; {\bf x}) \; p({\bf x})} {P(class=j)} $

where $p({\bf x})$ is the joint probability density of ${\bf x}$ over all classes (sum over all conditional densities, each multiplied with the relevant prior).

We now calculate the corrected posterior probability (with a prime) from Bayes formula

$ \begin{split} &P'(class=j \; \mid \; {\bf x}) = \\ &\; \; \; \; \frac{P'(class=j) \; \frac{P(class=j \; \mid \; {\bf x}) \; p({\bf x})} {P(class=j)} }{ P'(class=j) \; \frac{P(class=j \; \mid \; {\bf x})\; p({\bf x})} {P(class=j) } + P'(class \neq j) \; \frac{ P(class \neq j \; \mid \; {\bf x}) \; p({\bf x})} {P(class \neq j)}} \end{split} $

where $P'(class=j)$ is the prior in the skewed setting (i.e. corrected) and $P'(class=j \; \mid \; {\bf x})$ the corrected posterior. The smaller fractions in the equation above are actually the conditional densities $p({\bf x} \mid class=j)$ and $p({\bf x} \mid class \neq j)$.

The equation simplifies to the following

$ \begin{split} &P'(class=j \mid {\bf x}) = \\ &\; \; \; \; \frac{\frac{P'(class=j)}{P(class=j)} \; P(class=j \; \mid \; {\bf x})} {\frac{P'(class=j)}{P(class=j)} \; P(class=j \; \mid \; {\bf x}) + \frac{P'(class \neq j)}{P(class \neq j)} \; P(class \neq j \; \mid \; {\bf x})} \end{split} $

Q.E.D.

This correction formula applies to $2, 3, \ldots, n$ classes.

Application

You can apply this formula to probabilities from discriminant analysis, sigmoid feed-forward neural networks, and probabilistic random forest classifiers. Basically each type of classifier that produces posterior probability estimates can be adapted to any uneven prior distribution after successful training.

A final word on training. Many learning algorithms have difficulties with training well from uneven training sets. This certainly holds for back-propagation applied to multi-layer perceptrons.