It would take more than a terabyte just to store a dense distance matrix of that size, using double precision floating point numbers. It's probably not feasible to attack such a large-scale problem head-on using a standard MDS algorithm. The runtime scales as $O(n^3)$, and you may not even be able to fit the distance matrix in memory on a single machine. Depending on your situation, you may be able to work around it. How to do this depends on your specific data and problem, but here are a few examples:
1) PCA is equivalent to classical MDS using the Euclidean distance metric. So, if this is the type of MDS you want to perform, you can use PCA instead. An additional requirement is that you have access to the original data points as vectors, rather than just the distance matrix. There are many algorithms for efficiently running PCA on enormous datasets.
2) Use an approximate form of MDS. This approach could work if you want to perform classical MDS. It doesn't require a Euclidean distance metric or a vector space representation of the data (access to the distances is all that's required). Landmark MDS is a good example. The first step is to choose a smaller set of 'landmark points' that represent the full data set. These can be obtained simply by sampling randomly from the data, or using a quick, greedy algorithm to optimize them a little more. Clustering could also be used in principle, but it's computationally expensive and not necessary. Classical MDS is performed on the landmark points to embed them in a vector space. The classical MDS results for the landmark points are then used to map the full dataset into the embedding space. Although not originally formulated as such, it turns out that Landmark MDS works by using the Nyström approximation, which is a way to approximate the eigenvalues/eigenvectors of a large matrix using a smaller submatrix.
3) If you want to perform nonclassical MDS, methods like Landmark MDS won't work. This is because nonclassical variants of MDS are solved iteratively using optimization algorithms, rather than by solving an eigenvalue problem. It might work to take the following approach instead: a) Select a representative subsample of the data. b) Use your chosen flavor of MDS to embed these points into a vector space. c) Using the subsampled data points as exemplars, learn a mapping into the embedding space using a nonlinear regression method. Proper care would be needed to learn a good mapping, not overfit, etc. d) Use the learned mapping to obtain an embedding for the entire data set. I recall seeing a couple papers describe this kind of approach as a way to perform out-of-sample generalization for nonlinear dimensionality reduction algorithms in general. But, it seems to me that it could also be used as a way to efficiently approximate nonclassical MDS. It's possible that more specialized solutions exist; if so I'd be glad to hear about them.
4) Parallelization could be used to solve the full problem if absolutely necessary. For example, for classical MDS, you could distribute blocks of the distance matrix across machines, then use parallel/distributed matrix operations and eigensolver to perform MDS. Some scheme could probably be imagined for nonclassical MDS. But, this could be a hefty undertaking, and it's quite possible that approximate MDS would work well enough. So, that's a better place to start.
References:
De Silva and Tenenbaum (2004). Sparse multidimensional scaling using landmark points.
Platt (2005). FastMap, MetricMap, and Landmark MDS are all Nystrom Algorithms.
Regarding stress, I don't think there is a good value to recommend as it depends on things like the number of observations. I'd say either of the solutions you found were good fits, the second somewhat more so. However, with the second you have the added complexity of now trying to display and interpret a 3-d configuration of points. I'd plot the 2-d configuration and pairs of the 3 dimensions of the 3-d configuration and see if there is any qualitative difference that would suggest retaining the 3-d solution.
As with any p-value beyond linear models (GLMs, GAMs, etc), these p-values are approximate, relying on asymptotic behaviour (i.e. as sample sizes get large), more so for GAMs because they have to be computed whilst accounting for the selection of the smoothness parameter(s) which control the wiggliness of the fitted spline/surface. The p-value reported does this correction and is based on the surprising frequentist coverage properties of the Bayesian credible interval for the estimated spline. As a result, you can think of the p-values as being based on a test for a zero function (so a flat surface at zero in the case of ordisurf()
), and the p-value as an approximate summary of the test.
If you want to deep dive into this, see the references in ?summary.gam
regarding p-values, notably one of Simon Wood's two 2013 papers on p-values (not the one on random effects). Otherwise, just take the comments in ?summary.gam
at face value.
Best Answer
IIUC, you have two types of forests, and each of the 2000 rows in your dataset belongs to one of those types.
In principle, it is appropriate to use
adonis()
for testing whether the two types of forest are different, and with a p-value of less than 0.05, you can be quite confident that they indeed differ. However, the problem is that you have quite a lot of data (2000 rows), so that makes it easy foradonis()
to find some differences, even if they are not very relevant in your eyes. The two types of forest are probably not one hundred percent identical and giving lots of data to the test, it will always find that they differ. That is why, in general, one should be cautious with applying significance tests to large datasets.As far as NMDS is concerned, this is more of a visualization tool, so you can get some feeling about the relative positioning. This is often quite helpful to build intuition, e.g. to see how well the data of the two types of forests are separated, but it doesn't give you "concrete evidence". Note, that it is a map from your original space of 250 dimensions to the two-dimensional space. This is inevitably losing lots of information and it is difficult/impossible to figure out, what exactly is lost.
Your three questions:
adonis()
is, in principle, correct, but, as explained above, it might be "too good" to be of use. And it is totally fine to use NMDS as visualization help.