Solved – maximizing KL divergence as the objective function

kullback-leiblermachine learningmathematical-statisticsneural networks

As far as I know, the most common approach to train neural networks is to minimize the KL divergence between the data distribution and the output of the model distribution which results in minimizing the cross-entropy. Now assume we have a binary classification task, our goal can be optimizing a network such that the KL divergence between the two class distributions is maximized. I want to know is there any basic difference in these two different approaches? and can we optimize the latter approach explicitly with regards to the parameters of a neural network?

Best Answer

The difference is that for $KL(p||q)$ minimum over $p$ is attained (it's zero when the $p = q$) but maximum may not exist (or make sense in usual way).

KL-divergence is a convex function. Convex functions on bounded sets attain their minima. There is a whole branch of math that deals with convex optimization.

The same doesn't hold for maxima - for KL divergence: let $$p \sim Bernouli(1)$$ $$q_t \sim Bernoulli(t)$$ Then $$\lim_t KL(p||q_t) = -1 log(\frac{1}{t}) - 0log(\frac{0}{1-t}) = \lim_t - log(\frac{1}{t}) = \infty{}$$

BTW one other way to look at it is that KL-divergence works as sort of distance - minimizing distance usually makes sense, but maximizing it may not, if you don't pose any constraints, because the space where you measure your distance might be unbounded, so the distances are arbitrarily large.