Keep in mind that ridge regression can't zero out coefficients; thus, you either end up including all the coefficients in the model, or none of them. In contrast, the LASSO does both parameter shrinkage and variable selection automatically. If some of your covariates are highly correlated, you may want to look at the Elastic Net [3] instead of the LASSO.
I'd personally recommend using the Non-Negative Garotte (NNG) [1] as it's consistent in terms of estimation and variable selection [2]. Unlike LASSO and ridge regression, NNG requires an initial estimate that is then shrunk towards the origin. In the original paper, Breiman recommends the least-squares solution for the initial estimate (you may however want to start the search from a ridge regression solution and use something like GCV to select the penalty parameter).
In terms of available software, I've implemented the original NNG in MATLAB (based on Breiman's original FORTRAN code). You can download it from:
http://www.emakalic.org/blog/wp-content/uploads/2010/04/nngarotte.zip
BTW, if you prefer a Bayesian solution, check out [4,5].
References:
[1] Breiman, L. Better Subset Regression Using the Nonnegative Garrote Technometrics, 1995, 37, 373-384
[2] Yuan, M. & Lin, Y. On the non-negative garrotte estimator Journal of the Royal Statistical Society (Series B), 2007, 69, 143-161
[3] Zou, H. & Hastie, T. Regularization and variable selection via the elastic net Journal of the Royal Statistical Society (Series B), 2005, 67, 301-320
[4] Park, T. & Casella, G. The Bayesian Lasso Journal of the American Statistical Association, 2008, 103, 681-686
[5] Kyung, M.; Gill, J.; Ghosh, M. & Casella, G. Penalized Regression, Standard Errors, and Bayesian Lassos Bayesian Analysis, 2010, 5, 369-412
If you order 1 million ridge-shrunk, scaled, but non-zero features, you will have to make some kind of decision: you will look at the n best predictors, but what is n? The LASSO solves this problem in a principled, objective way, because for every step on the path (and often, you'd settle on one point via e.g. cross validation), there are only m coefficients which are non-zero.
Very often, you will train models on some data and then later apply it to some data not yet collected. For example, you could fit your model on 50.000.000 emails and then use that model on every new email. True, you will fit it on the full feature set for the first 50.000.000 mails, but for every following email, you will deal with a much sparser and faster, and much more memory efficient, model. You also won't even need to collect the information for the dropped features, which may be hugely helpful if the features are expensive to extract, e.g. via genotyping.
Another perspective on the L1/L2 problem exposed by e.g. Andrew Gelman is that you often have some intuition what your problem may be like. In some circumstances, it is possible that reality is truly sparse. Maybe you have measured millions of genes, but it is plausible that only 30.000 of them actually determine dopamine metabolism. In such a situation, L1 arguably fits the problem better.
In other cases, reality may be dense. For example, in psychology, "everything correlates (to some degree) with everything" (Paul Meehl). Preferences for apples vs. oranges probably does correlate with political leanings somehow - and even with IQ. Regularization might still make sense here, but true zero effects should be rare, so L2 might be more appropriate.
Best Answer
A simple way to do this is to subtract the "centering value" of the coefficient times its associated variable from the left-hand side. To go with your example,
$Y = \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \beta_4X_4 + e$
Assume the coefficient values should be centered at (5,1,-1,-5) respectively. Then:
$Y - 5X_1 -X_2 +X_3 +5X_4 = (\beta_1-5)X_1 + (\beta_2-1)X_2 + (\beta_3+1)X_3 + (\beta_4+5)X_4 + e$
and, redefining terms, you have:
$Y^* = \beta_1^*X_1 + \beta_2^*X_2 + \beta_3^*X_3 + \beta_4^*X_4 + e$
A standard ridge regression would shrink the $\beta_i^*$ towards 0, which is equivalent to shrinking the original $\beta_i$ towards the specified centering values. to see this, consider a fully-shrunk $\beta_4^* = 0$, then $\beta_4+5 = 0$ and therefore $\beta_4 = -5$. Shrinkage accomplished!