The original Isomap paper defined "residual variance" as follows (reference 42):
$$\text{residual variance} = 1 - R^2(\hat D_M, D_y)$$
where $R$ is the Pearson correlation coefficient over all entries of $\hat D_M$ and $D_Y$. $\hat D_M$ is the euclidean distance matrix for PCA and the geodesic distance matrix for Isomap. $D_Y$ is the euclidean distance matrix of the low dimensional embedding, this matrix changes with the number of dimensions you use for the embedding.
EDIT:
This is numerically different than the explained variance of PCA derived from the eigenvalues and I don't know if there is a direct connection between the two of them.
EDIT:
I asked here for the connection between residual and explained variance.
Before I attempt to answer your question I want to create a stronger separation between the methods you are referring to.
The first set of methods I believe you are referring to are neighborhood based dimensionality reduction methods, where a neighborhood graph is constructed where the edges represent a distance metric. Now to play devil's advocate against myself, MDS/ISOMAP can both be interpreted as a form of kernel PCA. So although this distinction seems relatively sharp, various interpretation shift these methods from one class to another.
The second set of methods you are referring to I would place in the field of unsupervised neural network learning. Autoencoders are a special architecture that attempts to map an input space into a lower-dimensional space that allows decoding back to the input space with minimal loss in information.
First, let's talk about benefits and drawbacks of autoencoders. Autoencoders are generally trained using some variant of stochastic gradient descent which yields some advantages. The dataset does not have to fit into memory, and can dynamically be loaded up and trained with gradient descent. Unlike a lot of methods in neighborhood-based learning which forces the dataset to exist in memory. The architecture of the autoencoders allows prior knowledge of the data to be incorporated into the model. For example, if are dataset contains images we can create an architecture that utilizes 2d convolution. If the dataset contains time-series that have long term connections, we can use gated recurrent networks (check out Seq2Seq learning). This is the power of neural networks in general. It allows us to encode prior knowledge about the problem into our models. This is something that other models, and to be more specific, dimensionality reduction algorithms cannot do.
From a theoretical perspective, there are a couple nice theorems. The deeper the network, the complexity of functions that are learnable by the network increases exponentially. In general, at least before something new is discovered, you are not going to find a more expressive/powerful model than a correctly selected neural network.
Now although all this sounds great, there are drawbacks. Convergence of neural networks is non-deterministic and depends heavily on the architecture used, the complexity of the problem, choice of hyper-parameters, etc. The expressiveness of neural networks causes problems too, they tend to overfit very quickly if the right regularization is not chosen/used.
On the other hand, neighborhood methods are less expressive and tend to run a deterministic amount of time until convergence based on much fewer parameters than neural networks.
The choice of method depends directly on the problem. If you have a small dataset that fits in memory and does not utilize any type of structured data (images, videos, audio) classical dimensionality reduction would probably be the way to go. But as structure is introduced, the complexity of your problem increases, and the amount of data you have grows neural networks become the correct choice.
Hope this helps.
Best Answer
Dimensionality reduction means that you map each many-dimensional vector into a low-dimensional vector. In other words, you represent (replace) each many-dimensional vector by a low-dimensional vector.
Linear dimensionality reduction means that components of the low-dimensional vector are given by linear functions of the components of the corresponding high-dimensional vector. For example in case of reduction to two dimensions we have:
If
f1
andf2
are (non)linear functions, we have a (non)linear dimensionality reduction.