Bayesian – How to Prove Posterior of Regression Coefficients is Gaussian in MAP Regularized Logistic Regression?

bayesianmap-estimationnormal distributionposteriorregression coefficients

The logistic regression model is
$$
p(y=\pm 1 \mid \mathbf{x}, \mathbf{w})=\sigma\left(y \mathbf{w}^{\mathrm{T}} \mathbf{x}\right)=\frac{1}{1+\exp \left(-y \mathbf{w}^{\mathrm{T}} \mathbf{x}\right)}
$$
It can be used for binary classification or for predicting the certainty of a binary outcome. See Cox & Snell (1970) for the use of this model in statistics. This note focuses only on computational issues related to maximum-likelihood or more generally maximum a-posteriori (MAP) estimation. A common prior to use with MAP is:
$$
p(\mathbf{w}) \sim \mathcal{N}\left(\mathbf{0}, \lambda^{-1} \mathbf{I}\right)
$$
Using $\lambda>0$ gives a "regularized" estimate of $\mathrm{w}$ which often has superior generalization performance, especially when the dimensionality is high (Nigam et al., 1999).

Given a data set $(\mathbf{X}, \mathbf{y})=\left[\left(\mathbf{x}_{1}, y_{1}\right), \ldots,\left(\mathbf{x}_{N}, y_{N}\right)\right]$, we want to find the parameter vector $\mathbf{w}$ which maximizes:
$$
l(\mathbf{w})=-\sum_{i=1}^{n} \log \left(1+\exp \left(-y_{i} \mathbf{w}^{\mathrm{T}} \mathbf{x}_{i}\right)\right)-\frac{\lambda}{2} \mathbf{w}^{\mathrm{T}} \mathbf{w}
$$
The gradient of this objective is
$$
\mathbf{g}=\nabla_{\mathbf{w}} l(\mathbf{w})=\sum_{\boldsymbol{i}}\left(1-\sigma\left(y_{i} \mathbf{w}^{\mathrm{T}} \mathbf{x}_{i}\right)\right) y_{i} \mathbf{x}_{i}-\lambda \mathbf{w}
$$

The Hessian of the objective is
$$
\mathbf{H}=\frac{\mathrm{d}^{2} l(\mathbf{w})}{\mathrm{d} \mathbf{w} \mathrm{d} \mathbf{w}^{\mathrm{T}}}=-\sum_{i} \sigma\left(\mathbf{w}^{\mathrm{T}} \mathbf{x}_{i}\right)\left(1-\sigma\left(\mathbf{w}^{\mathrm{T}} \mathbf{x}_{i}\right)\right) \mathbf{x}_{i} \mathbf{x}_{i}^{\mathrm{T}}-\lambda \mathbf{I}
$$
which in matrix form can be written
$$
\begin{aligned}
a_{i i} &=\sigma\left(\mathbf{w}^{\mathrm{T}} \mathbf{x}_{i}\right)\left(1-\sigma\left(\mathbf{w}^{\mathrm{T}} \mathbf{x}_{i}\right)\right) \\
\mathbf{H} &=-\mathbf{X} \mathbf{A} \mathbf{X}^{\mathrm{T}}-\lambda \mathbf{I}
\end{aligned}
$$
Note that the Hessian does not depend on how the x's are labeled. It is nonpositive definite, which means $l(\mathbf{w})$ is convex [sic].
For sufficiently large $\lambda$, the posterior for $\mathrm{w}$ will be approximately Gaussian:
$$
\begin{aligned}
p(\mathbf{w} \mid \mathbf{X}, \mathbf{y}) & \approx N\left(\mathbf{w} ; \hat{\mathbf{w}},-\mathbf{H}^{-1}\right) \\
\hat{\mathbf{w}} &=\operatorname{argmax} p(\mathbf{w}) \prod_{i} p\left(y_{i} \mid \mathbf{x}_{i}, \mathbf{w}\right)
\end{aligned}
$$

Is the gaussian distribution for weights a trivial result that the prior is normally distributed?

I think this proof might involve bayes rule, and they stated the likelihood function is can be approximated by

$$
p(\mathbf{y} \mid \mathbf{X}) \approx p(\mathbf{y} \mid \mathbf{X}, \hat{\mathbf{w}}) p(\hat{\mathbf{w}})(2 \pi)^{d / 2}|-\mathbf{H}|^{-1 / 2}
$$
where $d$ is the dimensionality of $\mathrm{w}$.

https://tminka.github.io/papers/logreg/minka-logreg.pdf

Best Answer

The article appears to be written for stats veterans rather then newbies and that is why they did not explain everything in detail (really hate when authors do that, especially when you're only getting into the field).

Let's start with a quick recap first.

The logistic regression model gives the probability of a binary label given a feature vector:

\begin{aligned} p(y=\pm1|\mathbf{x}, \mathbf{w}) = \sigma(\mathbf{w}^\top\mathbf{x}) = 1/(1+e^{-\mathbf{w}^\top\mathbf{x}}). \end{aligned}

Sometimes you add a bias parameter $b$ to the model, making the probability $\sigma(\mathbf{w}^\top\mathbf{x}+b)$. But very often it is dropped to ease the notation. It's not super hard to put it back in though!

Then the author does not say it explicitly but he's using the L2 regularization as far as I can tell. It's not super important but it can help you to put some things together later on I think.

I assume you are familiar with a Bayes' Rule so I'm not getting into much detail here.

It is also common to describe L2 regularized logistic regression as MAP (maximum a-posteriori) estimation with a Gaussian $\mathcal{N}\left(\mathbf{0}, \sigma^2_w \mathbb{I}\right)$ prior. The “most probable” weights, coincide with an L2 regularized estimate.

However, MAP estimation is not a “Bayesian” procedure. MAP can only be viewed as an approximation to the Bayesian procedure.

Laplace approximation.

In order to understand how the posterior of the weights

\begin{aligned} p(\mathbf{w} \mid \mathbf{X}, \mathbf{y}) & \approx N\left(\mathbf{w} ; \hat{\mathbf{w}},-\mathbf{H}^{-1}\right) \end{aligned}

has been derived, you can try the Laplace approximation.

The Laplace approximation is one of the possible ways to approximate a distribution with a Gaussian. The Laplace approximation sets the mode of the Gaussian approximation to the mode of the posterior distribution and matches the curvature of the log probability density at that location.

It's described in greater detail in: MacKay, D. J. C. (2003), Information Theory, Inference, and Learning Algorithms, pp341–342 or Murphy, K. P. (2013), Machine learning: a probabilistic perspective., p255.

And so you can try to match the distributions. First of all, find the most probable setting of the parameters. For example,

$$ \mathbf{w}^* = \operatorname{argmax}_\mathbf{w} p(\mathbf{w}|\mathbf{X},\mathbf{y}) = \operatorname{argmax}_\mathbf{w} \log p(\mathbf{w},\mathbf{X},\mathbf{y}). $$

The conditional probability on the left is what we intuitively want to optimize. The maximization on the right gives the same answer but contains the term you can actually compute. Why?

Because $\mathrm{log}$ is a monotonic transformation and hence, maximizing the log of a function is equivalent to maximizing the original function.

Then, if you follow the textbooks and match the minimum and curvature of the negative log-probability to those of a Gaussian, you'll get the Laplace approximation to the posterior distribution as:

\begin{aligned} p(\mathbf{w} \mid \mathbf{X}, \mathbf{y}) & \approx N\left(\mathbf{w} ; \hat{\mathbf{w}},-\mathbf{H}^{-1}\right). \end{aligned}

PS. There might be a lot of different ways to do it. But I believe this should work.

Best Answer

Let's start with a quick recap first.

Laplace approximation.

Related Solutions

Self-Study – Deriving the Ridge Regression $\boldsymbol{\beta}\mid \mathbf{y}$ Distribution

Related Question