This is a rather broad question.
First, I do not think Ridge regression shrinks coefficient to 0. It does not create sparsity so if you want to do feature selection it will be useless. You should consider the lasso instead or the elasticnet (which is a mix of ridge and lasso since a penalty L1 and a L2 one are added to the minimisation problem).
If your goal is really to select variables, have a look at the stability selection from Meinshausen and Bulhmann. The concept is to bootstrap and do Lasso regression. It uses the fact that there is a homotopic solution (meaning each coefficient in Lasso regression has a piecewise continuous solution path). Starting with a very high penalty and decreasing it step by step you have each coefficient being not null one by one. Now if you do that several times you can have a probability of not being null for each coefficient (meaning variable selected or not) for each penalty value.
This would be a good method if you have a lot of variables because Lasso can be seen as a convex relaxation of subset selection. So it is usually faster.
Dimension reduction (PCA for example) may not be designed to get a better performance accuracy because it is often unsupervised. See http://metaoptimize.com/qa/questions/9338/how-to-use-pca-for-classification for a more precise discussion on that subject.
Feature selection sometimes improves the performance of regularized models, but in my experience it generally makes generalization performance worse. The reason for this is that the more choices we make regarding our model (including the values of the parameters, the choice of features, the setting of hyper-parameters, the choice of kernel...), the more data we need to make these choices reliably. Generally we make these choices by minimizing some criterion evaluated over a finite set of data, which means that the criterion inevitably has a non-zero variance. As a result, if we minimize the criterion too agressively, we can over-fit it, i.e. we can make choices that minimize the criterion because of features that depend on the particular sample on which it is evaluated, rather than because it will produce a genuine improvement in performance. I call this "over-fitting in model selection" to differentiate it from the more familiar form of over-fitting resulting from tuning the model parameters.
Now the SVM is an approximate implementation of a bound on generalization performance that does not depend on the dimensionality, so in principle, we can expect good performance without feature selection, provided the regularization parameters are correctly chosen. Most feature selection methods have no such performance "guarantees".
For L1 methods, I certainly wouldn't bother with feature selection, as the L1 criterion is generally effective in trimming features. The reason that it is effective is that it induces an ordering in which features enter and leave the model, which reduces the number of available choices in selecting features, and hence is less prone to over-fitting.
The best reason for feature selection is to find out which features are relevant/important. The worst reason for feature selection is to improve performance, for regularised models, generally it makes things worse. However, for some datasets, it can make a big difference, so the best thing to do is to try it and use a robust, unbiased performance evaluation scheme (e.g. nested cross-validation) to find out whether yours is one of those datasets.
Best Answer
Yes, there is an alternative that combines ridge and LASSO together called Elastic net. This minimizes the loss function:
$$ L = \sum_i (y_i - \hat y_i)^2 + \lambda \left( \alpha \sum_j \| \beta_j \| + (1 - \alpha) \frac{1}{2} \sum_j \beta_j^2 \right) $$
Here, $\lambda$ controls the overall regularization strength, and $\alpha$ is a number between zero and one (inclusive) that adjusts the relative strengths of the ridge vs. LASSO penalization.
I do not know of a situation where best subset selection is appropriate which does not arise from software or computing environment constraints.
Yes. In the context of linear models, the elastic net is generalized by glmnet, which applies to any generalized linear model structure. For example, in logistic regression, the loss function for glmnet would be:
$$ L = \sum_i y_i \log(p_i) + (1 - y_i) \log(1 - p_i) + \lambda \left( \alpha \sum_j \| \beta_j \| + (1 - \alpha) \frac{1}{2} \sum_j \beta_j^2 \right) $$
These models are available in the
glmnet
package in R. You can learn to use the package here. You can read about how it works here.There are other options as well, multilevel models can be seen as another way to apply regularization. That is well covered in the book by Gelman and Hill.
There are also a multitude of Bayesian approaches, here the choice of prior can be thought of as a regularization strategy.