Chose the probability distribution and its parameters in maximum likelihood estimation

I'm reading the book "Mathematics for Machine Learning", it's a free book that you can find here. So I'm reading section 8.3 of the book which explains the maximum likelihood estimation (or MLE).
This is my understanding of how MLE works in machine learning:

Say we have a dataset of vectors $(x_1, x_2, …, x_n)$, we also have corresponding labels $(y_1, y_2, …, y_n)$ which are real numbers, finally we have a model with parameters $\theta$. Now MLE is a way to find the best parameters $\theta$ for the model, so that model would map $x_n$ to $\hat{y}_n$ and $\hat{y}_n$ is as close to $y_n$ as possible.

For each $x_n$ and $y_n$ we have a probability distribution $p(y_n|x_n,\theta)$. Basically it estimates how likely our model with parameters $\theta$ will output $y_n$ when we feed it $x_n$ (and the bigger the probability the better).

We then take a logarithm of each of the estimated probabilities and sum up all the logarithms, like this:
$$\sum_{n=1}^N\log{p(y_n|x_n,\theta)}$$

The bigger this sum the better our model with parameters $\theta$ explains the data, so we have to maximize the sum.

What I don't understand is how do we chose the probability distribution $p(y_n|x_n,\theta)$ and its parameters? In the book there is an Example 8.4, where they chose the probability distribution to be Gaussian distribution with zero mean, $\epsilon_n \sim \mathcal{N}(0,\,\sigma^{2})$. They then assume that the linear model $x_n^T\theta$ is used for prediction, so:
$$p(y_n|x_n,\theta) = \mathcal{N}(y_n|x_n^T\theta,\,\sigma^{2})$$
and I don't understand why they replaced zero mean with $x_n^T\theta$, also where do we get covariance $\sigma^{2}$?

So this is my question, how do we chose the probability distribution and it's parameters? In the example above the distribution is Gaussian but it could be any other distribution from those that exist and different distributions have different types and numbers of parameters. Also as I understood each $x_n$ and $y_n$ have its own probability distribution $p(y_n|x_n,\theta)$ which even more complicates the problem.

I would really appreciate your help. Also note that I'm just learning the math for machine learning and not very skilled. If you need any additional info please ask in the comments.

Thanks!

Best Answer

$\epsilon_n\sim \mathcal N(0, \sigma^2)$ is not the probability distribution of the data, it is the probability of the random noise/ error.

They assume the data follow a perfectly linear relationship with noise given by $\epsilon _n$ for each data point (label pair) $(\textbf x_n, y_n) $, hence the equation of the least squares regression line $y=\textbf x^T\theta$ will have some noise to it. Putting these together the author gets that the $p(y_n|\textbf x_n, \theta, \sigma^2)=\mathcal N(\textbf x_n^T\theta, \sigma^2)$. You are right that the $\sigma^2$ is an extra parameter that would need to be estimated and should therefore be included in the conditional statement.

Each $p(y_n|\textbf x_n, \theta)$ (omitting the $\sigma^2)$ this time for brevity, has its own distribution but under the random sample assumption the labels are assumed to be independent from each other. Hence the overall likelihood is

$$\prod_{i=1}^n p(y_n|\textbf x_n,\theta)$$

due to their coming from a random sample. Taking the logarithm gives the log likelihood, in which you turn the product into a summation and move the logarithm inside:

$$\log \prod_{i=1}^n p(y_n|\textbf x_n,\theta)=\sum_{i=1}^n \log p(y_n|\textbf x_n,\theta)$$

Best Answer

Related Solutions

Where do we get the prior for Maximum a Posteriori Estimation

Related Question