Dimensionality Reduction – Two Broad Categories Explained: With and Without Explicit Mapping Function

dimensionality reductionmanifold-learningnonparametricterminology

I think there are two very broad categories of dimensionality reduction (DR) techniques:

  1. We can compute an analytic form of mapping from the training data, say $x\mapsto y: y=f(x)$, where $f(\cdot)$ can be linear or nonlinear. In this case, given a new data point, we just plug it in as $x$ and compute its low-dimensional position. Such examples include PCA (unsupervised) and LDA (supervised);
  2. We don't really have a mapping function as before. The mapping is data-dependent in some sense. In this case, when a new data point comes, we have to mix it together with the training data and re-do DR. Such examples include Isomap and t-SNE.

Am I right about this? If so, what are the names of the two categories?


Bonus Question: For category 1 techniques, one can easily incorporate it into classification algorithms, or specifically $k$-NN. What if I want to use the distance computed by Category 2 techniques in my $k$-NN classifier? Must I do mix-and-reproject every time a new data point comes in? That would be computationally impossible…

Best Answer

These two categories are sometimes referred to as parametric and non-parametric dimensionality reduction.

Parametric dimensionality reduction yields an explicit mapping $f(x)$, and is called "parametric" because it considers only a specific restricted class of mappings. E.g. PCA can only yield a linear function $f(x)$.

Note that e.g. kernel PCA is a parametric method as well (the choice of kernel defines a class of mappings), even though the function $f(x)$ is "less explicit" than for PCA and can only be written as a sum over all training data points $f(x)=\sum_\mathrm{training\:set} f_i(x)$, thanks to the kernel trick.

In contrast, non-parametric dimensionality reduction is entirely "data-driven", meaning that the mapping $f$ depends on all the data. Consequently, as you say, test data cannot be directly mapped with the mapping learnt on the training data.

In the last years, there have been some developments about how to extend non-parametric dimensionality reduction methods such that they can handle test (also called "out-of-sample") data. I am not at all familiar with this literature, but I will give a couple of links that seem relevant. The first paper explicitly discusses $k$-nearest-neighbours classifiers used with [supervised analogues of] Isomap/t-SNE, as requested in your bonus question.


Bunte et al. 2011, A general framework for dimensionality reducing data visualization mapping

In recent years a wealth of dimension reduction techniques for data visualization and preprocessing has been established. Non-parametric methods require additional effort for out-of-sample extensions, because they just provide a mapping of a given finite set of points. In this contribution we propose a general view on non-parametric dimension reduction based on the concept of cost functions and properties of the data. Based on this general principle we transfer non-parametric dimension reduction to explicit mappings of the data manifold such that direct out-of-sample extensions become possible.

Gisbrecht et al. 2012, Out-of-Sample Kernel Extensions for Nonparametric Dimensionality Reduction

Nonnparametric dimensionality reduction (DR) techniques such as locally linear embedding or t-distributed stochastic neighbor (t-SNE) embedding constitute standard tools to visualize high dimensional and complex data in the Euclidean plane. With increasing data volumes and streaming applications, it is often no longer possible to project all data points at once. Rather, out-of-sample extensions (OOS) derived from a small subset of all data points are used. In this contribution, we propose a kernel mapping for OOS in contrast to direct techniques based on the DR method. This can be trained based on a given example set, or it can be trained indirectly based on the cost function of the DR technique. Considering t-SNE as an example and several benchmarks, we show that a kernel mapping outperforms direct OOS as provided by t-SNE.

There is also an older paper, Bengio et al. 2004, Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering -- which is apparently less of a general framework, but I cannot comment on what specifically is the difference between it and the 2011-2012 papers linked above. Any comments on that are very welcome.

Here is one figure from Bunte et al. to attract attention:

bunte et al.

Related Question