Regression – Exploring MCP and SCAD Penalized Regression as Bayesian Regression with Specific Priors

machine learningregressionregularization

So given that the coefficients of a ridge regression with a squared L2 norm penalty corresponds to the maximum a posteriori (MAP) estimate of a Bayesian regression with Gaussian priors on the coefficients and LASSO regression coefficients with L1 norm penalty to the MAP estimate of a Bayesian regression with Laplace priors on the coefficients, is there any known prior distributions in Bayesian regression that would lead to the MAP estimates being equivalent to MCP or SCAD penalized regression? Or could they be approximated by ridge or lasso regressions but with a particular choice of adaptive penalties (in the same way that LASSO can be approximated by adaptive ridge regression with adaptive penalties)? In which case they might still have some empirical Bayesian interpretation?

Best Answer

I see two distinct questions here:

1) Can MCP, SCAD, or other folded concave penalties with strict debiasing properties be viewed as MAP estimates wrt some prior?

No, or at least, not with respect to a proper prior. Intuitively, this is because in order to get unbiasedness, we require that the penalty be constant for large parameter values, which translates to placing constant probability density arbitrarily far from the origin, and consequently to a non-integrable (i.e. improper) prior.

To state the same information more precisely, recall that to go from a penalty $r(\beta)$ to the prior which would induce it, we exponentiate its negation $P(\beta) = e^{-r(\beta)}$. Let's consider the penalties which achieve the oracle rates described by [1]. Their unbiasedness condition on the penalty (bottom of page 3; I'll call it strict unbiasedness) is that $r'(\beta) = 0$ for $|\beta| \geq M$ for some $M$. Under suitable regularity conditions, this means that $r(|\beta|) = c$ for some constant $c$ for $|\beta|>M$. Thus:

$$\int_{\mathbb{R}} e^{-r(\beta)} d\beta\geq \int_{|\beta|>M} e^{-r(\beta)}d\beta = \int_{|\beta|>M} e^{-c}d\beta=\infty .$$

Now, just because the prior is improper doesn't necessarily stop us from using it in practice. Indeed, [3] show the results of running MCMC with a SCAD prior. So far as I can tell, they do not comment on the impropriety of the prior, nor establish the propriety of the posterior. But the impropriety certainly does mean that we lose some nice generative interpretations that proper Bayesian analysis carries.

2) Can we approximate the cost functions induced by these folded concave penalties using reweighted Lassos?

Yes, there have been proposed algorithms for fitting folded concave penalties via iteratively reweighted Lasso regressions, where each coefficient $\beta_p$ is endowed with its own penalty coefficient $\lambda_p^t$ which changes at each iteration. The working penalty at a given iteration is thus $\sum_{p}^P \lambda_p |\beta_p|$. If we wanted to match the gradient of the true penalty, we would set $\lambda_p^t = r'(\beta^{t-1})$. [2] proposes doing just one step of this, and shows that this procedure has the right asymptotics. I'm sure there's an article describing an iterative version of this with more than one step, but I can't for the life of me find it at the moment.

References

[1] https://www.jstor.org/stable/3085904

[2] https://projecteuclid.org/journals/annals-of-statistics/volume-36/issue-4/One-step-sparse-estimates-in-nonconcave-penalized-likelihood-models/10.1214/009053607000000802.full

[3] https://www.tandfonline.com/doi/abs/10.1080/03610918.2017.1280830