Unless identification of the most relevant variables is a key aim of the analysis, it is often better not to do any feature selection at all and use regularisation to prevent over-fitting. Feature selection is a tricky procedure and it is all too easy to over-fit the feature selection criterion as there are many degrees of freedom. LASSO and elastic net are a good compromise, the achieve sparsity via regularisation rather than via direct feature selection, so they are less prone to that particular form of over-fitting.
You may want to have a look at this Wikipedia article. It has a nice overview of the most popular algorithms, although some more recent developments, like tSNE, are missing.
In feature extraction the fundamental idea is looking for an alternative representation where the underlying structure of the data is more apparent. This is done by minimizing some error or energy functional which yields that mapping.
Some approaches like PCA, CCA, local linear embeddings (LLE), local linear projections (LLP), and some others are attractive because you end up solving a linear problem, for which there are efficient numerical methods. The nice thing about many of them (like LLE) is that they are able to consider non-linear mappings, but still you solve a linear system.
The idea is that you introduce a matrix whose elements encode relationships between samples (some distance). You then find the projections of the original data according to that matrix, into a lower dimensional space in such a way that the distortion is minimal. In this lower dimensional (usually two, so you can visualize it in your screen) the different patterns in your data are more evident. Usually, to describe those relationships between samples, only the k-closests samples to each data point are considered (which is a non-linear relationship).
Still, non-linear cases like tSNE and others, can also be solved efficiently by means of some gradient based optimization algorithm.
Which method might be better, depends on your data, and your amount of data (computational costs). I am not aware of any objective criteria which might let you decide beforehand which one to use, but just try them out. (Unless you know your data follows some trivial lineal pattern). For tSNE, LLE, and other methods are a numnber of implementations, often in Matlab, but also in other languages and packages.
Best Answer
I think you are doing a "best practice" approach already for feature selection. Using a regularised regression approach like LASSO and complementing those insights with a distribution-free model like Random Forest to decide the most important features is probably the best way to go.
Some minor suggestions: I would propose using Elastic Net to potentially have a small amount of $L_2$ regularisation. This should make our feature selection a bit more stable in case of correlated features. Similarly, taking a slightly more sophisticated approach of using Random Forests within a full Recursive Feature Elimination framework like Baruta (See Nilsson et al. for background, CRAN link) instead of simply relying on simple Random Forest variable importance will probably be beneficial.
Having said the above, we should use such feature selection approaches only if we cannot work with our original full dataset and/or we have problem collecting the features in question in the future (eg. too costly). Using a modelling approach that can actively regularise the resulting model (eg. gradient boosting machines where we can regularise the fit by properly picking the learning rate, tree-depth, minimum number of children per leaf node, etc.) is the best way to go. In that way we know we are not reusing our data, as well as that we are not losing valuable information that was potentially missed out during our feature selection step.
An issue, not touched upon is performing data reduction using some dimensionality reduction technique like PCA, ICA, NNMF, etc. These techniques do not "select features" per se but rather "combine features" to create meta-features of variable informational value. They can be very useful if we need a small subset of "information-rich" features. Nevertheless, these "information-rich" features are not guarantee to include more, less or any information relevant to our modelling task so they are not a silver bullet for feature selection. They usually present a convenient and condensed representation of our original data when we cannot work with their raw form.