The function cv.glmnet
from the R package glmnet does automatic cross-validation on a grid of $\lambda$ values used for $\ell_1$-penalized regression problems. In particular, for the lasso. The glmnet package also supports the more general elastic net penalty, which is a combination of $\ell_1$ and $\ell_2$ penalization. As of version 1.7.3. of the package taking the $\alpha$ parameter equal to 0 gives ridge regression (at least, this functionality was not documented until recently).
Cross-validation is an estimate of the expected generalization error for each $\lambda$ and $\lambda$ can sensibly be chosen as the minimizer of this estimate. The cv.glmnet
function returns two values of $\lambda$. The minimizer, lambda.min
, and the always larger lambda.1se
, which is a heuristic choice of $\lambda$ producing a less complex model, for which the performance in terms of estimated expected generalization error is within one standard error of the minimum. Different choices of loss functions for measuring the generalization error are possible in the glmnet package. The argument type.measure
specifies the loss function.
Alternatively, the R package mgcv contains extensive possibilities for estimation with quadratic penalization including automatic selection of the penalty parameters. Methods implemented include generalized cross-validation and REML, as mentioned in a comment. More details can be found in the package authors book: Wood, S.N. (2006) Generalized Additive Models: an introduction with R, CRC.
Let's work with the lasso. Recall how a lasso regression model is fitted, given $\lambda$:
$$\min_{\beta\in\mathbb{R}^p}\left\{\frac{1}{N}\|y-X\beta\|_2^2-\lambda\|\beta\|_1\right\}$$
The first part of the summand gives the mean squared 2-norm of the residuals. The second part gives the 1-norm of the parameter vector (typically not including the intercept entry $\beta_0$).
There is no reason whatsoever these two components should be comparable in magnitude. Your model could fit very well, yielding small residuals, but need large parameters. Or the other way around. Plus, you may or may not first standardize your predictors, which will change the parameter estimates.
This applies to the estimate for $\beta$, given $\lambda$. Now, if you optimize $\lambda$, perhaps using cross-validation, this means that a priori you cannot say anything about the likely range of $\lambda$, other than $\lambda\geq 0$.
TL;DR: you appear to have misremembered. The optimum $\lambda$ in no way needs to be in some specific interval. Therefore, getting a "surprising" value does not tell you anything about the appropriateness (or not) of your lasso model.
The same of course applies to ridge regression or the elastic net.
Best Answer
To consider this, let's look at what the Lasso estimates of the coefficients is trying to minimize. Suppose $y_i$ is the outcome for observation $i=1,\ldots,n$ and that $x_{ki}$ is the value of covariate $k=1,\ldots,p$ for individual $i$. We are interested in estimating the vector of $p$ coefficients, $\beta=\beta_1, \ldots, \beta_p$ is a vector of $p$ coefficients, corresponding to the $p$ covariates they are coefficients for, as well as the intercept $\beta_0$. Then the lasso estimate of $\beta$ is
$\hat{\beta}^{lasso} = \underset{\beta}{\arg\min}\left\{\underset{i=1}{\overset{n}{\sum}}\left( y_i - \beta_0 - \underset{k=1}{\overset{p}{\sum}}\beta_k x_{ki}\right)^2 + \lambda \underset{k=1}{\overset{p}{\sum}} \vert\beta_k\vert \right\}$, for some $\lambda \geq 0.$.
One reason the lasso is used is due to the fact that highly correlated covariates lead to unstable estimates of their corresponding $\beta$-coefficients, when estimated through ordinary least squares (OLS). For instance, if $X_1$ and $X_2$ are highly correlated, then the OLS estimates of $\beta_1$ and $\beta_2$ will vary a lot between samples. This leads to an inflated mean squared error in the estimates. Now, in lasso regression, since $\lambda \geq 0$, we see that the coefficients are shrunk towards 0 since the penalty term "punishes" estimates that are very large. This is, in essence, why lasso can combat some of the problems of multicolinearity.
But what lappens if we force $\lambda < 0$? Well, this is equivalent to continuing to let $\lambda \geq 0$ and then minimize:
$\hat{\beta}^{lasso} = \underset{\beta}{\arg\min}\left\{\underset{i=1}{\overset{n}{\sum}}\left( y_i - \beta_0 - \underset{k=1}{\overset{p}{\sum}}\beta_k x_{ki}\right)^2 - \lambda \underset{k=1}{\overset{p}{\sum}} \vert\beta_k\vert \right\}$, for some $\lambda \geq 0.$.
(Note the minus before the penalty term, where previously there was a plus.) Now we are instead encouraging the estimated coefficients to be as large as possible. My intuition is that this would be especially true for covariates that are independent of $y_i$. So by forcing $\lambda < 0$, you would get the estimates of coefficients that are too far away from 0.