Bayesian Regression – What Is Meant by Bayesian Machine Learning?

bayesianclassificationposterior

Suppose I have a classification task and I assume a Gaussian discriminative model:
$$
P(y|x,\theta)= N(y|\mu_x,\sigma_x)
$$

where $x\in \{0,1\}$ are the features (1 for Company A, 0 for Company B) and $y\in R$ are the delivery time.
The book "Probabilistic Machine Learning: An Introduction" (Murphy, 2022) said that there are two ways to model the parameters: $\mu_x,\sigma_x$

  1. Use MLE which solves the parameters as the empirical mean and variance respectively.
  2. Do a Bayesian approach, utilizing $P(\theta|y,x)$

I fully understand the derivation and reasoning for using choice 1. However, I can't wrap my head around choice 2.

Suppose I use a full Bayesian approach and I used a Gaussian prior $N(\mu_x|\mu_0, \sigma_0)$ to model (assuming that $\sigma_x$ is given for simplicity):
$$
P(\theta|y,x,\sigma_x)=N(\mu_x|\hat{\mu},\hat{\sigma})
$$

where $\hat{\sigma}$ and $\hat{\mu}$ are linear weighted combinations of the prior parameters and the parameters that arrived from using MLE.


After computing, in a fully Bayesian manner, the parameters $\hat{\sigma},\hat{\mu}$ of $P(\mu_x|y,x,\sigma_x)$, how can I use these to solve the earlier prediction (regression) task ?

Best Answer

A Bayesian computation provides not just point estimates of the unknown parameters (as in "standard" regression) but a full probability distribution of those parameters.

If your model is

$$ y|x,\theta \sim f(x,\theta) $$

where $\theta$ represents the unknown parameters of the model, then the Bayesian calculation gives the posterior probability distribution of $\theta$,

$$ \hat P(\theta) \equiv P(\theta | x,y) \propto f(x,\theta)\pi(\theta) $$

from which you can calculate the prediction for a new data point $x^*$, by integrating over all possible values of $\theta$ given its posterior distribution

$$ P(y^*|x^*) = \int d\theta P(y^*|x^*,\theta) \hat P(\theta) $$

which is again a probability distribution for $y^*$ (called the posterior predictive distribution).

You can use this distribution to calculate, for example, the mean of $y^*$ as well as intervals having a particular probability of containing $y^*$ (Credible Intervals) , as demonstrated for example by this plot (taken from this blog )

enter image description here

Related Question