Solved – How to know when to use linear dimensionality reduction vs non-linear dimensionality reduction

dimensionality reductionmanifold-learning

I am trying to decide whether to use linear dimensionality reduction methods (eg. PCA) vs. non-linear dimensionality reduction methods (eg. t-SNE) for my high-dimensional data set. However, I know nothing about the underlying structure of the data. Is there a test to know whether it is better to use one or the other?

Best Answer

One approach is to learn more about the structure of the data. Dimensionality reduction supposes that the data are distributed near a low dimensional manifold. If this is the case, one might choose PCA if the manifold is (approximately) linear, and nonlinear dimensionality reduction (NLDR) if the manifold is nonlinear. So, some questions to address: are the data low dimensional and, if so, are they distributed near a nonlinear manifold?

Estimating dimensionality and checking for nonlinearity

A good first step is to perform PCA and examine the variance along each component, or the fraction of variance explained ($R^2$) as a function of the number of components. Typically, one looks for an elbow in the plot, or the number of components needed to explain some fixed fraction of the variance (often 95%). Alternatively, there are more principled procedures for choosing the number of components. This gives an estimate of the dimensionality of the linear subspace that the data approximately occupy.

If the data are distributed near a low dimensional nonlinear manifold, then the intrinsic dimensionality will be less than the dimensionality of this linear subspace. For example, imagine a curved 2d sheet embedded in 3 dimensions (e.g. the classic swiss roll manifold). Now project it linearly into 5 dimensions. The extrinsic dimensionality would be 5, and the data would occupy a 3 dimensional linear subspace. However, the intrinsic dimensionality would be 2. The dimensionality of the linear subspace is higher than the intrinsic dimensionality because of the curvature of the manifold.

Many intrinsic dimensionality estimators have been described in the literature. Using one of these methods, one can estimate the intrinsic dimensionality of the data and compare this to the dimensionality of the linear subspace (estimated using PCA). If the intrinsic dimensionality is less, this suggests the manifold could be nonlinear. Keep in mind that we're working with estimates that may be subject to error, so this is somewhat of a heuristic procedure.

Fortunately, intrinsic dimensionality estimators are often independent of any particular NLDR algorithm. This is nice because there are dozens of NLDR algorithms to choose from, and they each operate under different assumptions and preserve different forms of structure in the data. Keep in mind that an intrinsic dimensionality of $k$ doesn't imply that any particular NLDR algorithm will be able to find a good $k$ dimensional representation. For example the surface of a sphere is intrinsically two dimensional, but many NLDR algorithms would require three dimensions to represent it because it can't be flattened.

Other considerations

Sometimes a choice can be made on the basis of practical considerations. These are things like runtime/memory costs, ease of use and/or interpretation, etc. I described some of these issues for PCA vs. NLDR in this post (the question was framed as 'why use PCA instead of NLDR?' so the answer leans toward PCA, which is not always the most appropriate method).

Sometimes it makes sense to simply try multiple methods and see what works best for your application. For example, if dimensionality reduction is used as a preprocessing step for a downstream supervised learning algorithm, then the choice of dimensionality reduction algorithm is a model selection problem (along with the dimensionality and any hyperparameters). This can be addressed using cross validation. Sometimes no dimensionality reduction at all works best in this context. If dimensionality reduction is performed for visualization, then you might choose the method that helps give better visual intuition about the data (highly application specific).

Related Question