Solved – Does prior distribution affect classification results SVM

bayesianmachine learningpriorsupervised learningsvm

I trained an SVM (RBF kernel, optimized C and G) on a dataset with a balanced class distribution (i.e., 50% positive, 50% negative class samples). Testing the model on a corpus with an unbalanced class distribution (i.e., 1% positive) shows significant overgeneration of the model, as about half of the test instances received a positive class label.

Does this mean that the model takes prior distribution information into account? Can someone help to explain?

Best Answer

What you are observing is the effect of dataset shift. It occurs when the joint distribution $P(x,y)$ of input ($x$) and label ($y$) vary widely between the training and test datasets. In such cases, the model trained on the training data performs poorly on the test dataset. Prior probability shift is when only the distribution over $y$ changes and everything else stays the same. It is a common issue in simple generative models.

The fundamental assumption of supervised learning is that the joint probability distribution $P(x,y)$ will remain unchanged while training and testing a model. As $P(x,y)=P(y|x)P(x)=P(x|y)P(y)$, the joint distribution over training and test dataset may change if any of the four probability distribution changes.

You can refer to this link which was mentioned in an answer about handling unbalanced dataset.

You can learn more about dataset shift from this book where the term was first used. You can skip to page 12 of the book to know more about prior probability shift.

Related Question