Where do we get the prior for Maximum a Posteriori Estimation

machine learningprobability distributions

I'm reading the book "Mathematics for Machine Learning", it's a free book that you can find here. The section 8.3.2 is about Maximum A Posteriori Estimation (or MAP).

What I understood about MAP is that it is similar to maximum likelihood estimation. In the context of MLE we have probability distribution $p(x|\theta)$ which shows us how likely data $x$ is to be generated by model with parameters $\theta$. In context of MAP we are inverting the distribution $p(x|\theta)$ using Bayes' rule (which I simplified):
$$p(\theta|x) \propto p(x|\theta)p(\theta)$$

Point is that instead of asking how likely data $x$ is to be generated by model with parameters $\theta$ (which is captured by $p(x|\theta)$) we're asking how likely model with parameters $\theta$ will generate our data $x$ (which is captured by $p(\theta|x)$).

However to find $p(\theta|x)$ we have to know $p(\theta)$, which is a prior knowledge about parameters $\theta$. What I don't understand is, in a context of machine learning, were do we get this prior knowledge about model parameters? What does $p(\theta)$ even mean, probability distribution of model parameters? What does it mean to have a probability distribution of parameters?

Best Answer

An example can be found in pages 363-366 (section 11.4) of your book.

Assume we have a Gaussian mixture model (GMM). Assume K components and each data point can be generated by exactly one mixture component. Introduce a binary variable $z_k\in\{0,1\}$ for whether the kth mixture component generated the data point so that $p(\textbf x|z_k=1)=\mathcal N(\textbf x|\mu_k, \Sigma _k)$.

Assume the indicators are unknown and place a prior distribution on them: $p(\textbf z)=\pi=(\pi_1,\dots,\pi_k)$.

Then the posterior distribution can be obtained using Bayes rule:

$$p(z_k=1|\textbf x)=\frac{p(z_k=1)p(\textbf x|z_k=1)}{p(\textbf x)}=\frac{p(z_k=1)p(\textbf x|z_k=1)}{\sum _{j=1}^Kp(z_j=1)p(\textbf x|z_j=1)}=\frac{\pi_k\mathcal N(\textbf x|\mu_k, \Sigma _k)}{\sum _{j=1}^K\pi_j\mathcal N(\textbf x|\mu_j, \Sigma _j)}$$

Example:

Assume we have a mixture of three normal distributions with initial probability vector $(\pi_1, \pi_2, \pi_3)=(.3, .4, .3)$. These correspond to the initial probabilities that we think the mixture components would be drawn from. Assume mixture component one is $\mathcal N(0, 1)$, second is $\mathcal N(5, 2)$, third is $\mathcal N(-3, 1)$. Assume we observe a data point $x=2$. (Note this is a one-dimensional example.)

What would our posterior distribution $(\pi_1,\pi_2,\pi_3|x=2)$ be in light of the observation. I calculate:

$p(x)=.3\mathcal N(2|0, 1)+.4\mathcal N(2|5, 2)+.3\mathcal N(2|-3, 1)=0.02809076$. This will be the denominator in the calculations.

$p(z_1=1|x=2)=\frac{.3\mathcal N(2|0, 1)}{0.02809076}=.58$

$p(z_1=2|x=2)=\frac{.4\mathcal N(2|5, 2)}{0.02809076}=.42$

$p(z_1=3|x=2)=\frac{.3\mathcal N(2|-3, 1)}{0.02809076}=.000016$

Thus, our posterior belief is $(\pi|x=2)=(.58, .42, .000016)$ and if we were to use something like MAP we would pick that it came from mixture component 1.

Something similar to this is a Hidden Markov Model where the states are not independent of each other but depend on the previous state.

enter image description here

Related Question