Solved – Bayesian inferencing: how iterative parameter updates work

bayesianestimationinferenceoptimizationposterior

I have been struggling with this for a while. A typical optimisation problem can be viewed as optimising some cost function which is a combination of a data term and a penalty term which encourages certain solutions. And normally there is a weighting term between the two.

In the bayesian setting, this can be interpreted with the usual prior and the likelihood function. In the current problem that I am trying to understand, I model the prior as a multivariate normal with zero mean and the precision matrix equal to $\lambda \Lambda$ where $\lambda$ can be thought of this regularisation weighting and $\Lambda$ is some appropriate precision matrix structure that encodes the plausible solutions somehow. In my particular example, the precision matrix encodes some smoothness constraints on the estimated parameters i.e. the prior encourages smooth solutions. In this case, $\lambda$ denotes the strength of this smoothness penalty term. A $\lambda$ of zero would mean the ML estimate where we only optimise the cost function i.e. the likelihood function. This is because as $\lambda$ decreases, the precision decreases and hence the variance of each of the parameter in the prior increases. So, low values of $\lambda$ will move towards the unregularized solution.

Now, a typical thing I have seen is that there is some sort of an iterative scheme, where we first start with an approximation to $\lambda$ and compute the distribution over the other parameter of interest using some approximate scheme like variational Bayes or Expectation Propagation and then use this approximation to update our estimate of $\lambda$ (assuming priors over $\lambda$ are of the conjugate form, usually done with a Gamma distribution which also keeps it positive).

Now, my question is that if I start with a very low value for $\lambda$ as my approximation, then the prior term would hardly have any effect. Would this not push the estimated distribution towards solutions that are less plausible i.e. basically give high probabilities to unregularized solutions? I am having a lot of trouble understanding how this update scheme can actually find good values for $\lambda$ i.e. finding the value of $\lambda$ that is optimal with respect to the observed data. So, basically what I have trouble understanding it is what is stopping the inference to drive this value of $\lambda$ down to zero or close to zero to prefer the unregularized maximum likelihood estimate? I really do not see how this value of $\lambda$ is being driven by the data or the evidence term.

Best Answer

The problem of finding the hyperparameters is called evidence approximation. It is nicely explained in Bishop's book (page 166), or else in this paper, in great detail.

The idea is that your problem has the canonical form (the predictive distribution for a new sample), $$ p(t|\mathbf{t}) = \int p(t|\mathbf{w},\alpha) p(\mathbf{w}|\mathbf{t},\alpha,\beta)p(\alpha,\beta|\mathbf{t}) d\mathbf{w} d\alpha d\beta $$ where $\mathbf{t}$ is your training data, $\alpha,\beta$ are hyperparameters, and $\mathbf{w}$ are your weights.

First, computing this integral is expensive or maybe even intractable, and has an additional difficulty: $p(\alpha,\beta|\mathbf{t})$. This term tells us that we need to integrate over the ensemble of interpolators. In practice means that you would train your ensemble, that is, each of the $p(\mathbf{t}|\alpha,\beta)$, and using Bayes' theorem, $$ p(\alpha,\beta|\mathbf{t}) \propto p(\mathbf{t}|\alpha,\beta) p(\alpha,\beta) $$ you could calculate each term applying Bayes. And finally sum over all of them.

The evidence framework assumes (in the referred paper validity conditions for this assumption are given) that $p(\alpha,\beta|\mathbf{t})$ a dominant peak at some values $\hat{\alpha},\hat{\beta}$. Under this assumption you substitute your integral by a point estimation at the peak, namely, $$ p(t|\mathbf{t}) \approx \int p(t|\mathbf{w},\alpha) p(\mathbf{w}|\mathbf{t},\hat{\alpha},\hat{\beta}) $$

If the prior is relatively flat, then the problem of finding $\hat{\alpha}$ and $\hat{\beta}$ finally reduces to maximizing the likelihood $p(\mathbf{t}|\alpha,\beta)$. In your case the integral term has a closed form solution (is also Gaussian).

P.S. In statistics this method is known as empirical Bayes. If you google for it, you shall find a few references. I find this one to be very really nice, since it works easier problems in detail, and carefully introduces all necessary terms.

Related Question