Solved – MDS on large dataset (R or Python)

multidimensional scalingpythonr

I have a large 400000 $\times$ 400000 dataset (dissimilarity matrix) and I want to do multi-dimensional scaling on it. However, after looking at the generic cmdscale() function in R, it only takes maximum 46340 $\times$ 46340 matrix as input. Is there any way I can work around this? Also, does Python has the same kind of limitation (in sklearn package)?

I should add that memory is not an issue here.

Best Answer

It would take more than a terabyte just to store a dense distance matrix of that size, using double precision floating point numbers. It's probably not feasible to attack such a large-scale problem head-on using a standard MDS algorithm. The runtime scales as $O(n^3)$, and you may not even be able to fit the distance matrix in memory on a single machine. Depending on your situation, you may be able to work around it. How to do this depends on your specific data and problem, but here are a few examples:

1) PCA is equivalent to classical MDS using the Euclidean distance metric. So, if this is the type of MDS you want to perform, you can use PCA instead. An additional requirement is that you have access to the original data points as vectors, rather than just the distance matrix. There are many algorithms for efficiently running PCA on enormous datasets.

2) Use an approximate form of MDS. This approach could work if you want to perform classical MDS. It doesn't require a Euclidean distance metric or a vector space representation of the data (access to the distances is all that's required). Landmark MDS is a good example. The first step is to choose a smaller set of 'landmark points' that represent the full data set. These can be obtained simply by sampling randomly from the data, or using a quick, greedy algorithm to optimize them a little more. Clustering could also be used in principle, but it's computationally expensive and not necessary. Classical MDS is performed on the landmark points to embed them in a vector space. The classical MDS results for the landmark points are then used to map the full dataset into the embedding space. Although not originally formulated as such, it turns out that Landmark MDS works by using the Nyström approximation, which is a way to approximate the eigenvalues/eigenvectors of a large matrix using a smaller submatrix.

3) If you want to perform nonclassical MDS, methods like Landmark MDS won't work. This is because nonclassical variants of MDS are solved iteratively using optimization algorithms, rather than by solving an eigenvalue problem. It might work to take the following approach instead: a) Select a representative subsample of the data. b) Use your chosen flavor of MDS to embed these points into a vector space. c) Using the subsampled data points as exemplars, learn a mapping into the embedding space using a nonlinear regression method. Proper care would be needed to learn a good mapping, not overfit, etc. d) Use the learned mapping to obtain an embedding for the entire data set. I recall seeing a couple papers describe this kind of approach as a way to perform out-of-sample generalization for nonlinear dimensionality reduction algorithms in general. But, it seems to me that it could also be used as a way to efficiently approximate nonclassical MDS. It's possible that more specialized solutions exist; if so I'd be glad to hear about them.

4) Parallelization could be used to solve the full problem if absolutely necessary. For example, for classical MDS, you could distribute blocks of the distance matrix across machines, then use parallel/distributed matrix operations and eigensolver to perform MDS. Some scheme could probably be imagined for nonclassical MDS. But, this could be a hefty undertaking, and it's quite possible that approximate MDS would work well enough. So, that's a better place to start.

References:

De Silva and Tenenbaum (2004). Sparse multidimensional scaling using landmark points.

Platt (2005). FastMap, MetricMap, and Landmark MDS are all Nystrom Algorithms.