Where do we get the prior for Maximum a Posteriori Estimation

I'm reading the book "Mathematics for Machine Learning", it's a free book that you can find here. The section 8.3.2 is about Maximum A Posteriori Estimation (or MAP).

What I understood about MAP is that it is similar to maximum likelihood estimation. In the context of MLE we have probability distribution $p(x|\theta)$ which shows us how likely data $x$ is to be generated by model with parameters $\theta$. In context of MAP we are inverting the distribution $p(x|\theta)$ using Bayes' rule (which I simplified):
$$p(\theta|x) \propto p(x|\theta)p(\theta)$$

Point is that instead of asking how likely data $x$ is to be generated by model with parameters $\theta$ (which is captured by $p(x|\theta)$) we're asking how likely model with parameters $\theta$ will generate our data $x$ (which is captured by $p(\theta|x)$).

However to find $p(\theta|x)$ we have to know $p(\theta)$, which is a prior knowledge about parameters $\theta$. What I don't understand is, in a context of machine learning, were do we get this prior knowledge about model parameters? What does $p(\theta)$ even mean, probability distribution of model parameters? What does it mean to have a probability distribution of parameters?

Best Answer

An example can be found in pages 363-366 (section 11.4) of your book.

Assume we have a Gaussian mixture model (GMM). Assume K components and each data point can be generated by exactly one mixture component. Introduce a binary variable $z_k\in\{0,1\}$ for whether the kth mixture component generated the data point so that $p(\textbf x|z_k=1)=\mathcal N(\textbf x|\mu_k, \Sigma _k)$.

Assume the indicators are unknown and place a prior distribution on them: $p(\textbf z)=\pi=(\pi_1,\dots,\pi_k)$.

Then the posterior distribution can be obtained using Bayes rule:

$$p(z_k=1|\textbf x)=\frac{p(z_k=1)p(\textbf x|z_k=1)}{p(\textbf x)}=\frac{p(z_k=1)p(\textbf x|z_k=1)}{\sum _{j=1}^Kp(z_j=1)p(\textbf x|z_j=1)}=\frac{\pi_k\mathcal N(\textbf x|\mu_k, \Sigma _k)}{\sum _{j=1}^K\pi_j\mathcal N(\textbf x|\mu_j, \Sigma _j)}$$

Example:

Assume we have a mixture of three normal distributions with initial probability vector $(\pi_1, \pi_2, \pi_3)=(.3, .4, .3)$. These correspond to the initial probabilities that we think the mixture components would be drawn from. Assume mixture component one is $\mathcal N(0, 1)$, second is $\mathcal N(5, 2)$, third is $\mathcal N(-3, 1)$. Assume we observe a data point $x=2$. (Note this is a one-dimensional example.)

What would our posterior distribution $(\pi_1,\pi_2,\pi_3|x=2)$ be in light of the observation. I calculate:

$p(x)=.3\mathcal N(2|0, 1)+.4\mathcal N(2|5, 2)+.3\mathcal N(2|-3, 1)=0.02809076$. This will be the denominator in the calculations.

$p(z_1=1|x=2)=\frac{.3\mathcal N(2|0, 1)}{0.02809076}=.58$

$p(z_1=2|x=2)=\frac{.4\mathcal N(2|5, 2)}{0.02809076}=.42$

$p(z_1=3|x=2)=\frac{.3\mathcal N(2|-3, 1)}{0.02809076}=.000016$

Thus, our posterior belief is $(\pi|x=2)=(.58, .42, .000016)$ and if we were to use something like MAP we would pick that it came from mixture component 1.

Something similar to this is a Hidden Markov Model where the states are not independent of each other but depend on the previous state.

Best Answer

Related Solutions

Chose the probability distribution and its parameters in maximum likelihood estimation

Related Question