[Math] minimum kullback leibler estimator

estimation-theorystatistical-inferencestatistics

Suppose that one has independent and identically distributed samples $x_i,i=1,…,n$ from some unknown density and one wants to fit a probability distribution $f_\theta(x)$, where $\theta$ is a (finite-dimensional) parameter, e.g. $\theta \in \mathbb{R}$, to that data. It seems that one way to estimate $\theta$ could be to minimize the Kullback-Leibler divergence between an estimate of pdf computed from data $x_i$ (using e.g. some kernel estimator) and the model $f_\theta$. I know that KL is used in many problems as a distance measure between two probability distributions (though it not actually a distance since it is not symmetric) but I don't think I have seen any theory about an estimator that it defined as a minimizer of this distance.

Also there is a connection to maximum likelihood estimation since if $p(x)$ is the pdf computed from the data we have \begin{align} KL(p || q) &= \int p(x) \log\frac{p(x)}{f_\theta(x)} dx = \int p(x) \log p(x) dx – \int p(x) \log f_\theta(x) dx \\ &= H(p) – E(\log f_\theta(x)),\end{align}
where the first term (entropy of $p$) is constant and the latter term can be approximated with $\frac{1}{n}\sum_{i=1}^n \log f_\theta(x_i)$. So maximizing the last term (i.e. ML estimator) gives approximately the minimum KL divergence.

So my question is that is there any general theory about such estimator and some (not overly uncommon) applications where such method is used? Or is it so that such estimator typically has not-so-nice properties or is it just overrun by other estimation methods (like maximum likelihood, (generalized) method of moments etc.) that are likely easier to apply. Also I know there is some theory about minimum Hellinger distance estimation which seems to have nice efficiency and robustness properties but I would guess that is not the case with KL then?

Best Answer

It is known that Min-KL and ML do coincide in the full esponential family. See, for example, here. In these cases, all ML theory applies.

In other cases, min-KL can still be seen as an M-estimator (a.k.a. Empirical Risk Minimizer). In which case, M-estimation theory applies.