Solved – MDS: Is Kruskal’s Stress-1 affected by scale of the data, or the number of points

distancemultidimensional scaling

In Multidimensional Scaling, Kruskal's Stress-1 is a commonly used measure of fit.

It is defined as:

$\sqrt{\frac{\sum (d_{ij}-\delta_{ij})^{2}}{\sum d_{ij}^{2}}}$

where $d_{ij}$ represents the distances, and $\delta_{ij}$ represents the disparities.

I'm looking to use it to compare across studies in which there are differing numbers of data points, and in which the scales are different. Is this measure unaffected by the scale, and by the number of points? Why/why not?

As for what scale means, imagine that in one study the MDS related to distances between cities measured in miles, while in the other the distances between cities were measured in kilometres.

I would have thought that part of the point of normalization was to ensure that comparisons across studies with different numbers of data points could be made. However, I sometimes see diagrams like the following

That diagram shows that when tested on random data, Stress-1 increases with more points.

Best Answer

Scale should make no difference. But, all else being equal, the greater the number of points, the higher the stress.

As ttnphns comments, the cause of this is that when you have fewer observations, the model will over-fit, so the stress is downwardly biased. As the number of observations grow, the extent of the bias reduces.

Pretty much every measure of goodness-of-fit with a fixed minimum and maximum, as in this case, suffers from the same problem. For example, R-squared goes down as the number of observations go up, all else being equal, and the Adjusted R-squared was developed to address this. While I do appreciate that it would be great to have measures that were not influenced by the number of observations, as the degree of "noise" is going to differ from problem to problem, this is probably not solvable (e.g., Adjusted R-squared is not used by people with a good knowledge of regression).

You can compare between different data sets by randomly sampling the number of observations in the larger data set. For example if data set 1 has 20 observations, and data set 2 has 30, randomly sample 20 from data set 2 and compare the stress with 1. If you repeat this multiple times you will be able to do a significance test comparing the stress levels.

Related Solutions

Solved – Help me understand nMDS algorithm

Few opening remarks. In nMDS you have a matrix of dissimilarities $D_{ij}$ (not distances; for instance this can be a per cent of people that said in some poll that i&j are not similar). What you want to obtain is a set of points ($E=[X_i]$) representing objects on M-dim space; having it, you have the matrix of distances between objects in this space $d_{ij}$.
nMDS tries to guess such $E$ that $d_{ij}$ has the same rank as $D_{ij}$; it is like connecting each object pair with spring the more strong the less dissimilar the pair is and then releasing the whole configuration -- after relaxation, the objects that has been connected using stronger springs will be nearer.
Point 4 is something like overfitting regression. You have some approximation of objects position $E^a$, and so also approximated distances $d^a_{ij}$. Now you can do regression $d^a_{ij}$~$f(D_{ij})$ and using it count the distances that should be if the $D$ would be represented perfectly $d^r_{ij}=f(D_{ij})$.
Still, because you cannot directly count $E$ from $d^r$ (this is the problem of nonlinear optimization here), you must somehow mutate $E$ so that the distances will approach $d^r$. The standard method here is to mimic physical analogy with springs and move objects which are connected with most extended springs (having largest $|d^a_{ij}-d^r_{ij}|$) towards themselves so that the potential energy (this STRESS) of the system will be minimized mostly.

Solved – Adding labels to points using mds and scatter3d package with R

Basically, what you need is to store your scatterplot3d in a variable and reuse it like this:

x <- replicate(10,rnorm(100))
x.mds <- cmdscale(dist(x), eig=TRUE, k=3)
s3d <- scatterplot3d(x.mds$points[,1:3])
text(s3d$xyz.convert(0,0,0), labels="Origin")

Replace the coordinates and text by whatever you want to draw. You can also use a color vector to highlight the groups of interest.

The R.basic package, from Henrik Bengtsson, seems to provide additional facilities to customize 3D plots, but I never tried it.

Best Answer

Related Solutions

Solved – Help me understand nMDS algorithm

Solved – Adding labels to points using mds and scatter3d package with R

Related Question