Solved – Do discriminative models overfit more than generative models

generative-modelsmachine learning

In an interview, the interviewer said that discriminative models tend to overfit more than generative models because they solve a more complex problem and hence consume more resources (or parameters) doing so. For definitions on discriminative and generative models, you can sse this post Generative vs. discriminative

I thought more parameters would lead to overfitting but here the logic is reversed. Any ideas?

Best Answer

This is a fun question as it provides good context for why the often used heuristic that more parameters $\implies$ more risk of overfitting is just that, a heuristic. To ground the discussion let's consider what is in some sense the simplest problem, binary classification. As a specific example we will take the canonical generative-discriminative pair, naive Bayes and logistic regression respectively. It is important we look at corresponding pairs in this way. Otherwise anything we have to say is going to be useless. We could certainly come up with extraordinarily flexible generative models (imagine replacing the factored conditional $p(x|y)$ in naive Bayes with something like delta functions) which will essentially always overfit.

First we should define what we mean by overfitting. One useful definition of overfitting involves deriving tight probabilistic bounds on generalization error based on training set error and the type of classifiers we're using. In this case a relevant parameter is the VC dimension of the hypothesis class $H$ (the set of classifiers being used). Put simply the VC dimension of a hypothesis class is given by the largest set of examples such that for any possible labeling of that set, there exists an $h \in H$ which can label them in exactly that way. So given $m$ examples in a binary classification setting there are $2^m$ ways to label them. If there exists some set of $m$ example, and for each of the $2^m$ possible ways of labeling there is some $h \in H$ that labels them correctly, we can conclude $\text{VCdim}(H) \ge m$. Furthermore we say the set of $m$ points is shattered by $H$.

In turns out that naive Bayes and logistic regression both have the same VC dimension since they both classify examples in $n$ dimensions using an $n$ dimensional hyperplane. The VC dimension of an $n$ dimensional hyperplane is $n+1$. You can show this by upper-bounding the VC dimension using Randon's Theorem and then giving an example of a set of $n+1$ points which is shattered. Furthermore in this case our intuition from low dimensional examples generalizes to higher dimensions and we can make pretty pictures.

4 points in 2 dimensions can't be shattered with a line.

In this sense, both are equally likely to overfit because we'll generally get similar bounds on generalization. This also explain why 1 nearest neighbor is the king of overfitting, it has infinite VC dimension because it shatters every set of examples. This isn't the end of the story though.

Another useful way to formally define overfitting is based on obtaining generalization bounds in terms of the best predictor in the hypothesis class, i.e., the hypothesis we would all agree is best if we had an infinite number of examples. Here, things look different for the naive Bayes/logistic regression comparison. The first thing to note is that because of the way naive Bayes is parameterized (they need to specify valid conditionals, sum to 1, etc), we are not guaranteed to converge on the optimal linear classifier, even given an infinite number of examples. On the other hand, logistic regression will. So we can conclude that in fact the set of all naive Bayes classifiers is a proper subset of all logistic regression classifiers. This provides some evidence that indeed, naive Bayes classifiers may be less prone to overfit in this sense simply because they are less powerful/more constrained. Indeed this is the case. In short if we are performing classification in $n$ dimensional space, naive Bayes requires on the order of $O(\log n)$ samples to converge whp to the best naive Bayes classifier. Logistic regression requires on the order of $O(n)$. For a reference for this result see On Discriminative vs. Generative classifiers by Ng and Jordan.