Solved – How to (L1 / L2) regularization be equivalent to using a prior when priors can’t be changed

bayesianmap-estimationphilosophicalregularization

I understand the argument for how training with an L1/L2 regularizer is the same thing as finding the MAP estimate when the prior is Gaussian/Laplace. But there's a crucial difference. In Bayes' theorem, the prior must not be influenced by the data, while in practice ML people tend to tune the regularizer to maximize the validation score. This seems to contradict the Bayesian interpretation. In fact, it sounds closer to "Empirical Bayes". How would someone in both the ML and Bayesian communities respond to this?

Best Answer

If you have a hyperprior $P(\alpha)$ (where $\alpha$ is the regularization parameter), then $P(\alpha|x) = \frac{P(x|\alpha)P(\alpha)}{P(x)}$. If we make the assumption that $P(\alpha)$ is quite wide and nearly constant within a wide range of values, then $P(\alpha|x) \propto P(x|\alpha)$, so we can select our regularization based on whatever works best on the validation set.

Related Question