Feature Selection vs Feature Extraction – Understanding the Differences

dimensionality reductionfeature selection

As per my understanding in dimensionality reduction, Feature selection chooses a subset from a list of available variables and, Feature extraction transforms available variables into lower dimension. How exactly does the transformation work? Is it like an interaction term of two or more variables?

Could anyone please explain if one technique is more preferred over other or does it depend on the data set?

Also, is one preferred over other for Linear Vs Non-Linear dimensionality reduction?

Any help is much appreciated

Best Answer

You may want to have a look at this Wikipedia article. It has a nice overview of the most popular algorithms, although some more recent developments, like tSNE, are missing.

In feature extraction the fundamental idea is looking for an alternative representation where the underlying structure of the data is more apparent. This is done by minimizing some error or energy functional which yields that mapping.

Some approaches like PCA, CCA, local linear embeddings (LLE), local linear projections (LLP), and some others are attractive because you end up solving a linear problem, for which there are efficient numerical methods. The nice thing about many of them (like LLE) is that they are able to consider non-linear mappings, but still you solve a linear system.

The idea is that you introduce a matrix whose elements encode relationships between samples (some distance). You then find the projections of the original data according to that matrix, into a lower dimensional space in such a way that the distortion is minimal. In this lower dimensional (usually two, so you can visualize it in your screen) the different patterns in your data are more evident. Usually, to describe those relationships between samples, only the k-closests samples to each data point are considered (which is a non-linear relationship).

Still, non-linear cases like tSNE and others, can also be solved efficiently by means of some gradient based optimization algorithm.

Which method might be better, depends on your data, and your amount of data (computational costs). I am not aware of any objective criteria which might let you decide beforehand which one to use, but just try them out. (Unless you know your data follows some trivial lineal pattern). For tSNE, LLE, and other methods are a numnber of implementations, often in Matlab, but also in other languages and packages.