The short answer is, its up to you, depending on your interest. In the past I have used AIC for Lasso.
However it sounds like you are using this model for prediction, and thus using the mis-classification rate is a good idea. However misclassification can be categorized in many ways. Are you interested in the the absolute % classified correctly? Or maybe you just care about of those classified as 1 (or yes, etc), how many of those were classified correctly? I would do some reading into Positive Predictive values, Negative predictive values, etc.
https://en.wikipedia.org/wiki/Positive_and_negative_predictive_values
In addition when doing your cross validation, there are a plethora of criteria you could use to validate your model. A short list of other common criterion are:
- $R^2$
- $MSE$
- $Mallow’s$ $C_p$
- $AIC$
Look them up and see which is most relevant to you!
We can copy the situation from Why is Lasso penalty equivalent to the double exponential (Laplace) prior?
We minimize the loss
$$L(\beta \vert X, y) = \sum_{i=1}^n (y_i-f(\beta,X_i))^2 + \lambda \sum_{i=1}^p \vert \beta_i \vert $$
which is like maximizing
$$e^{-L(\beta \vert X, y)} = e^{- \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i-f(\beta,X_i))^2} \cdot e^{-\frac{\lambda}{2\sigma^2} \sum_{i=1}^p \vert \beta_p \vert } $$
And appart from some scaling constant (to normalize the function) this can be seen as equal to the posterior distribution when it is the product of the likelihood of a Gaussian distribution and the prior distribution for the coefficients which is proportional to
$$f_{\text{prior}} \propto e^{-\frac{\lambda}{2\sigma^2} \sum_{i=1}^p \vert \beta_i \vert }$$
With non-negative lasso this cost function remains the same (except that we do not allow negative coefficients $\beta_i$)
So we can keep the same Laplace form of the prior function, and need to change it only when $\beta_i$ are negative.
$$f_{\text{prior}} \propto \begin{cases}
e^{-\lambda \sum_{i=1}^p \vert \beta_i \vert } & \quad \text{if $\forall i: \beta_i \geq 0$} \\
0 & \quad \text{if $\exists i: \beta_i < 0$} \\
\end{cases}$$
Which is indeed just like the exponential distribution as prior.
In the case of constraints on smoothness they will end up in the distribution of the prior. For instance if you have some extra constraint which is a function of the $\beta$
$$L(\beta \vert X, y) = \sum_{i=1}^n (y_i-f(\beta,X_i))^2 + \lambda_1 \sum_{i=1}^p \vert \beta_i \vert + \lambda_2 g(\beta)$$
then this will end up as an exponential term in the prior distribution
$$f_{\text{prior}} \propto e^{-\frac{\lambda_1}{2\sigma^2} \sum_{i=1}^p \vert \beta_i \vert -\frac{\lambda_2}{2\sigma^2} g(\beta) }$$
In this case you may not so easily get a simple prior distribution like the $$e^{-\frac{\lambda}{2\sigma^2} \sum_{i=1}^p \vert \beta_i \vert} =e^{-\frac{\lambda}{2\sigma^2} \vert \beta_1 \vert} \cdot e^{-\frac{\lambda}{2\sigma^2} \vert \beta_2 \vert} \cdot e^{-\frac{\lambda}{2\sigma^2} \vert \beta_3 \vert} \cdot \dots $$ which partitions into independent distributions for the individual $\beta_i$.
For instance for fused Lasso, in which case the differences of neighbours $\vert \beta_{i+1} -\beta_{i} \vert$ are included in the penalty, you would get something like the following (for only two coefficients)
$$f_{\text{prior}} \propto e^{-\frac{1}{2\sigma^2} \left( \lambda_1 \vert \beta_1 \vert +\lambda_1 \vert \beta_2 \vert + \lambda_2 \vert \beta_2 - \beta_1 \vert \right) }$$
Best Answer
L2 penalty penalizes the sum of squared betas but not via a constraint such as $< C$. The L1 penalty is the lasso. For the Bayesian lasso see the 2008 JASA paper by Trevor Park and George Cassella.