You may want to have a look at this Wikipedia article. It has a nice overview of the most popular algorithms, although some more recent developments, like tSNE, are missing.
In feature extraction the fundamental idea is looking for an alternative representation where the underlying structure of the data is more apparent. This is done by minimizing some error or energy functional which yields that mapping.
Some approaches like PCA, CCA, local linear embeddings (LLE), local linear projections (LLP), and some others are attractive because you end up solving a linear problem, for which there are efficient numerical methods. The nice thing about many of them (like LLE) is that they are able to consider non-linear mappings, but still you solve a linear system.
The idea is that you introduce a matrix whose elements encode relationships between samples (some distance). You then find the projections of the original data according to that matrix, into a lower dimensional space in such a way that the distortion is minimal. In this lower dimensional (usually two, so you can visualize it in your screen) the different patterns in your data are more evident. Usually, to describe those relationships between samples, only the k-closests samples to each data point are considered (which is a non-linear relationship).
Still, non-linear cases like tSNE and others, can also be solved efficiently by means of some gradient based optimization algorithm.
Which method might be better, depends on your data, and your amount of data (computational costs). I am not aware of any objective criteria which might let you decide beforehand which one to use, but just try them out. (Unless you know your data follows some trivial lineal pattern). For tSNE, LLE, and other methods are a numnber of implementations, often in Matlab, but also in other languages and packages.
(1) It's not about the criterion: backward elimination / forward selection are greedy algorithms which don't search the whole set of models. So e.g. forward selection will stop when no predictor can be added that improves the criterion, but won't check if removing a predictor that came in earlier before adding another improves it.
(2) All possible subsets will find the "best" model according to any criterion you set it, but using that model to make predictions on new data often reveals a big drop in performance. The wider your search for an best fitting model, the more you capitalize on chance fluctuations in whichever criterion & the more optimistic your assessment of that model's performance. (So stepwise methods can sometimes work better just because they restrict the search space. ) See here for an excellent exposition of the problem.
Best Answer
This is a rather broad question.
First, I do not think Ridge regression shrinks coefficient to 0. It does not create sparsity so if you want to do feature selection it will be useless. You should consider the lasso instead or the elasticnet (which is a mix of ridge and lasso since a penalty L1 and a L2 one are added to the minimisation problem).
If your goal is really to select variables, have a look at the stability selection from Meinshausen and Bulhmann. The concept is to bootstrap and do Lasso regression. It uses the fact that there is a homotopic solution (meaning each coefficient in Lasso regression has a piecewise continuous solution path). Starting with a very high penalty and decreasing it step by step you have each coefficient being not null one by one. Now if you do that several times you can have a probability of not being null for each coefficient (meaning variable selected or not) for each penalty value.
This would be a good method if you have a lot of variables because Lasso can be seen as a convex relaxation of subset selection. So it is usually faster.
Dimension reduction (PCA for example) may not be designed to get a better performance accuracy because it is often unsupervised. See http://metaoptimize.com/qa/questions/9338/how-to-use-pca-for-classification for a more precise discussion on that subject.