Solved – Nonlinear Dimensionality Reduction: geometric/topologic algorithms vs. autoencoders

autoencodersdeep learningdimensionality reductionmachine learningmanifold-learning

As I understand there are three main approaches to nonlinear dimensionality reduction:

  • Manifold learning (geometric/topologic algorithms like ISOMAP, LLE, LTSA)
  • Autoencoders
  • things that do not fit into first 2 categories (probability inspired t-SNE, Kernel PCA, etc)

What are benefits and drawbacks of first 2 approaches?

Can one think than autoencoders will completely outperform manifold learning like deep learning outshadowed most machine learning algorithms in terms of performance?

Best Answer

Before I attempt to answer your question I want to create a stronger separation between the methods you are referring to.

The first set of methods I believe you are referring to are neighborhood based dimensionality reduction methods, where a neighborhood graph is constructed where the edges represent a distance metric. Now to play devil's advocate against myself, MDS/ISOMAP can both be interpreted as a form of kernel PCA. So although this distinction seems relatively sharp, various interpretation shift these methods from one class to another.

The second set of methods you are referring to I would place in the field of unsupervised neural network learning. Autoencoders are a special architecture that attempts to map an input space into a lower-dimensional space that allows decoding back to the input space with minimal loss in information.

First, let's talk about benefits and drawbacks of autoencoders. Autoencoders are generally trained using some variant of stochastic gradient descent which yields some advantages. The dataset does not have to fit into memory, and can dynamically be loaded up and trained with gradient descent. Unlike a lot of methods in neighborhood-based learning which forces the dataset to exist in memory. The architecture of the autoencoders allows prior knowledge of the data to be incorporated into the model. For example, if are dataset contains images we can create an architecture that utilizes 2d convolution. If the dataset contains time-series that have long term connections, we can use gated recurrent networks (check out Seq2Seq learning). This is the power of neural networks in general. It allows us to encode prior knowledge about the problem into our models. This is something that other models, and to be more specific, dimensionality reduction algorithms cannot do.

From a theoretical perspective, there are a couple nice theorems. The deeper the network, the complexity of functions that are learnable by the network increases exponentially. In general, at least before something new is discovered, you are not going to find a more expressive/powerful model than a correctly selected neural network.

Now although all this sounds great, there are drawbacks. Convergence of neural networks is non-deterministic and depends heavily on the architecture used, the complexity of the problem, choice of hyper-parameters, etc. The expressiveness of neural networks causes problems too, they tend to overfit very quickly if the right regularization is not chosen/used.

On the other hand, neighborhood methods are less expressive and tend to run a deterministic amount of time until convergence based on much fewer parameters than neural networks.

The choice of method depends directly on the problem. If you have a small dataset that fits in memory and does not utilize any type of structured data (images, videos, audio) classical dimensionality reduction would probably be the way to go. But as structure is introduced, the complexity of your problem increases, and the amount of data you have grows neural networks become the correct choice.

Hope this helps.