Solved – Kullback–Leibler divergence

kullback-leiblermachine learningmaximum likelihoodself-study

I am watching this great lecture by Nando De Freitas.

He establishes the KL divergence by using maximum liklihood estimation.

However, there is one step I don't really understand.
enter image description here

I do understand the steps from a math standpoint. I just wonder why he wants to measure the similarity between the distributions P(x|theta) and P(x|theta_0).

I also wonder how I can imagine the distribution P(x|theta_0).

As I understand, theta_0 is just the parameter of the bias term.

Why do we even need a distribution for this?

Best Answer

I just wonder why he wants to measure the similarity between the distributions $p(x|\theta)$ and $p(x|\theta_0)$.

You're kind of asking the wrong question. If we're in a setting where we're using MLE, then the idea behind it is that we're estimating the parameters of our model with the parameters that maximize the likelihood. It will probably be the case that the true likelihood ($p(x|\theta_0)$) isn't actually going to be in the parametric family of likelihoods we're working with!

What he's doing here is showing $that\ performing\ MLE\ is\ equivalent$ to minimizing the KL divergence between the true likelihood and the family of likelihoods we're using for the MLE. So while the true $p(x|\theta_0)$ might not actually be inside of the family of likelihoods you're performing MLE over, what this tells us is that the MLE from our family of likelihoods will be the closest in this family to the true distribution $in\ KL\ divergence$. This is nice because even if we're off in our model specification, we find that the MLE will still be close to the truth in some sense.