Pure Bayesian regression approach

bayesianlinear regressionregression

Based on my understading estimating the Maximum a posteriori (MAP) for the regression coefficients stil heavily relies on a 'frequentist approach'. This because I'm simply assming a Gaussian prior $p(\textbf{w}|\alpha) = \mathcal{N}(\textbf{w}|\textbf{0},\alpha^{-1}I)$ and then using Bayes theorem which states that

$$p(\textbf{w}| \textbf{X,t}) \propto p(\textbf{t,X}|\textbf{w})p(\textbf{w}|\alpha)$$

Where on the right hand side I have the prior and likelihood function.

Note: On many books I find the notation $p(\textbf{t}|\textbf{X,w})$ for the likelihood but that sounds a bit confusing to me because the idea is that I'm considering a function of parameters $\textbf{w}$ that is conditional on a dataset $\mathcal{D}$ so I also find the notation $p(\mathcal{D}|\textbf{w}$) that I like much more because for me $\mathcal{D} = \{\textbf{X,t}\}$ but then when it comes to consider inputs and targets the notation becomes the one I initially wrote on top and this confuses me a bit sometimes.. why this? END NOTE

Estimating the MAP still relies on frequentist approach because I'm still going to do Maximum Likelihood this time on likelihood multiplied by prior and then maximizing so I still find a 'punctual' value for the coefficients.

For a 'Pure Bayesian' treatment, I consider the parameters $\textbf{w}$ as random variables and I want to find the predictive distribution directly from the dataset. So basically I'm reading that:

\begin{align}
p(t|\textbf{x};\textbf{X,t}) =& \int p(t,\textbf{w}|\textbf{x};\textbf{X,t}) \, d\textbf{w}\\
=& \int p(t|\textbf{w,x};\textbf{X,t})p(\textbf{w}|\textbf{x};\textbf{X,t}) \,d\textbf{w}\\
=& \int p(t|\textbf{w,x})p(\textbf{w}|\textbf{X,t})d\textbf{w}.
\end{align}

I would like to understand this last chain of equalities: first of all, what is the quantity $p(t|\textbf{x};\textbf{X,t})$ supposed to represent? It is not a likelihood function because it does not depend on parameters $\textbf{w}$. Also why are we computing these transformations inside the probability distributions?

Best Answer

I don't have enough reputation to comment on your post to request clarification as there exist some quantities you haven't defined. However, I am inferring from the notational conventions you have adopted that you are operating in the area of statistics/machine learning. On the queries you have outlined, it's best perhaps to clarify the likelihood function in this setting, MAP estimation and then the predictive distribution, highlighting aspects of your query as I proceed.

Context.

I am going to follow the notational conventions of the book Pattern Recognition and Machine Learning by Christopher Bishop. It should be remarked that the notation used in this book explicitly supports Bayesian interpretations of statistics, rather than frequentist interpretations.

Consider a training dataset $\{\mathbf{x}_n, t_n \}$ for $n = 1, ... ,N$ consisting of $N$ IID input vectors $\mathbf{x}_i$ and $N$ IID scalar target variables $t_n$. We stack the $N$ input vectors into a matrix $\mathbf{X}$, and the scalar target variables into a vector $\mathbf{t}$. Denote the parameter vector $\mathbf{w}$ which consists of regression coefficients.

We assume that we can model the target variable $t_n$ with the deterministic function $y(\mathbf{x}_n, \mathbf{w})$ with additive Gaussian noise (i.e. $\epsilon \sim \mathcal{N}(0, \beta^{-1})$ where the precision/inverse variance is $\beta$):

$$t_n = y(\mathbf{x}_n; \mathbf{w}) + \epsilon = \mathbf{w}^T\mathbf{x}_n + \epsilon$$

Due to the assumption of additive Gaussian noise, the likelihood of a single $t_n$ will be $p(t_n | \mathbf{x}, \mathbf{w}, \beta) = \mathcal{N}(t_n | y(\mathbf{x}_i, \mathbf{w}), \mathbf{\beta}^{-1})$.

Emphasising that the target variables are multiple observations of a single target variable, rather than a single observation of a multivariate target vector, encoded in the independence of the scalar target variables $t_n$, we have the following likelihood function on $\mathbf{t}$:

$$p(\mathbf{t} | \mathbf{X}, \mathbf{w}, \beta) = \prod^N_{n=1} \mathcal{N}(t_n | \mathbf{w}^T \mathbf{x}_n, \beta^{-1})$$

Now we define a Gaussian prior on the parameter vector $\mathbf{w}$, so we have $p(\mathbf{w} | \alpha) = \mathcal{N}(\mathbf{w} | \mathbf{0}, \alpha^{-1}I)$.

The likelihood function.

Now on the query concerning whether the likelihood should be $p(\mathcal{D} | \mathbf{w}) = p(\mathbf{t, X} | \mathbf{w})$ or whether it should be $p(\mathbf{t} | \mathbf{X}, \mathbf{w})$.

I am fairly certain that is the latter rather than the former, through consideration of the context of the linear regression in statistics/supervised learning in machine learning. That is, we are interested in modelling the distribution of the target variables $\mathbf{t}$ using the information contained in the regressors $\mathbf{X}$ (i.e. conditioning on these variables), and on a suitable value of the parameter $\mathbf{w}$. As we have assumed that the $\mathbf{t}$ are a function of the $\mathbf{X}$ that is linear in the parameter $\mathbf{w}$, and additive Gaussian noise, the likelihood function $p(\mathbf{t} | \mathbf{X}, \mathbf{w})$ arises naturally from this assumption. So to clarify, the likelihood $p(\mathbf{t} | \mathbf{X}, \mathbf{w})$ is both a function of the input matrix $\mathbf{X}$ and of the parameter $\mathbf{w}$, and the latter remains to be estimated according to a procedure with desirable statistical properties.

Now I think the confusion possibly arises from the fact that you using the notation $\mathcal{D} = \{\mathbf{t}, \mathbf{X} \}$ to refer to the "dataset", which is an acceptable form of notation, but it runs the risk of masking the distinction, crucial to the linear regression/supervised learning context, between the parts of the dataset $\mathcal{D}$ you are seeking model i.e. $\mathbf{t}$, and the information you conditioning on, the input variables/regressors, $\mathbf{X}$.

Hence your specification of the posterior distribution in this setting should read:

$$p(\mathbf{w} | \mathbf{t}, \mathbf{X}, \alpha, \beta) \propto p(\mathbf{t} | \mathbf{X}, \mathbf{w}, \beta) p(\mathbf{w} | \alpha)$$

And in Bishop you will find he drops the $\alpha$ and $\beta$ for notational brevity, and also the input matrix $\mathbf{X}$. The latter is because in supervised learning/discriminative modelling in machine learning we are not interested in modelling the distribution of the input variables $\mathbf{X}$. Yielding:

$$p(\mathbf{w} | \mathbf{t}) \propto p(\mathbf{t} | \mathbf{w}) p( \mathbf{w})$$

MAP estimation.

You are correct concerning MAP estimation being frequentist. Even though we take the Bayesian route and assume that the parameter $\mathbf{w}$ is a random variable, and we compute a posterior distribution on $\mathbf{w}$, the MAP estimator is a point estimator. That is we select a value of $\mathbf{w}$ so as to maximise the posterior $p(\mathbf{w} | \mathbf{t}, \mathbf{X}, \alpha, \beta)$:

$$\mathbf{w}_{MAP} = \underset{\mathbf{w}}{\text{argmax}} \space p(\mathbf{w} | \mathbf{t}, \mathbf{X}, \alpha, \beta) = \underset{\mathbf{w}}{\text{argmax}} \space p(\mathbf{t} | \mathbf{X}, \mathbf{w}, \beta) p(\mathbf{w} | \alpha)$$

And yes you are correct concerning the distinction between MAP estimation and maximum likelihood estimation. In the latter case you maximise the likelihood $p(\mathbf{t} | \mathbf{X}, \mathbf{w}, \beta)$. However, in doing MAP estimation, you are estimating the parameter $\mathbf{w}$ under a different model where you have added $L_2$/Tikhonov regularisation, known as ridge regression. See references at the bottom for further info.

Predictive distribution.

In order to understand what is going on here, you need to make a distinction between the notation for the test-set, and the training set. So modifying your notation to denote a new input vector belonging to the test set as $\mathbf{x}_{0}$ we are interested in making a prediction of the target variable, which we denote $t_0$ (this is unknown).

Now via a marginalisation argument where we integrate out the parameter $\mathbf{w}$, the predictive distribution is:

$$\begin{align} p(t_0 | \mathbf{x}_0, \mathbf{t}, \mathbf{X}, \alpha, \beta) &= \int p(t_0, \mathbf{w} | \mathbf{t}, \mathbf{x}_0, \mathbf{X}, \alpha, \beta) d\mathbf{w} \\ &= \int p(t_0 | \mathbf{w}, \mathbf{t}, \mathbf{x}_0, \mathbf{X}, \beta) p(\mathbf{w} | \mathbf{x}_0, \mathbf{t}, \mathbf{X}, \alpha, \beta) d\mathbf{w} \\ &= \int \underbrace{p(t_0 | \mathbf{w}, \mathbf{x}_0, \beta)}_{\text{likelihood}} \underbrace{p(\mathbf{w} | \mathbf{t}, \mathbf{X}, \alpha, \beta)}_{\text{posterior}} d\mathbf{w} \end{align}$$

Similar to above and in Bishop, you can drop all the $\alpha$ and $\beta$ for notational brevity. In going from the 1st to the 2nd equality we just probability rules. In going from the 2nd to the 3rd equality we use conditional independence:

$$p(t_0 | \mathbf{w}, \mathbf{t}, \mathbf{x}_0, \mathbf{X}, \beta) = p(t_0 | \mathbf{w}, \mathbf{x}_0, \beta)$$

$$p(\mathbf{w} | \mathbf{x}_0, \mathbf{t}, \mathbf{X}, \alpha, \beta) = p(\mathbf{w} | \mathbf{t}, \mathbf{X}, \alpha, \beta)$$

If you aren't comfortable with why we can invoke conditional independence just say in the comments and I will reply, as this post is getting too long for my liking. Notice that the two terms on the right above are just the likelihood and posterior we computed previously.

Concerning your query on what the predictive distribution $p(t_0 | \mathbf{x}_0, \mathbf{t}, \mathbf{X}, \alpha, \beta)$ is supposed to represent.

It is just the product of the likelihood and posterior, with the parameter $\mathbf{w}$ integrated out. Intuitively, you are evaluating the likelihood of a value $t_0$ given $\mathbf{x}_0$ for a particular $\mathbf{w}$. Then weighting that likelihood by your current belief about the parameter $\mathbf{w}$ given data $\mathcal{D} = \{\mathbf{t}, \mathbf{X} \}$. Then integrating over all possible values $\mathbf{w}$ can take.

This not a fully Bayesian approach, but more Bayesian than say attempting to predict $t_0$ using the MAP point estimator $\mathbf{w}_{\text{MAP}}$ and computing $t_0 \approx \mathbf{w}_{\text{MAP}}^T\mathbf{x}_0$.

References

For appropriate references on Bayesian linear regression, look at Chapter 3 of "Pattern Recognition and Machine Learning" by Christopher Bishop (from which I have used notation, although some of the arguments I've made above aren't fully explained). And there is a good table on how Bayesian one can get on p175 of "Machine Learning: A Probabilistic Perspective" by Kevin Murphy. There is also very clear exposition from John Paisley which is excellent for the link between MAP estimation, ridge regression, and Bayesian linear regression here - see lecture 5 in particular . It's free to enroll so if you sign up you can download all the course materials for free.

Related Question