Solved – In multidimensional scaling, how can one determine dimensionality of a solution given a stress value

multidimensional scaling

In multidimensional scaling, how can one determine dimensionality of a solution given a stress value? From what I understand, stress value is inversely related to the number of dimensions of a MDS solution, and that higher stress value indicates that there is a lot of error (i.e. badness-of-fit) in the current model, indicating a solution with more dimensions. Are the randomly generated coordinates, number of variables, and number of categories in a variable related?

Best Answer

In multidimensional scaling, how can one determine dimensionality of a solution given a stress value?

Having a stress value it is not possible to determine the dimensionality of the dataset. At best, you can evaluate whether the value is low or high (this evaluation is also a bit problematic to me).

From what I understand, stress value is inversely related to the number of dimensions of a MDS solution,

correct

and that higher stress value indicates that there is a lot of error (i.e. badness-of-fit) in the current model,

correct

indicating a solution with more dimensions.

Not very accurate conclusion. consider stress as a function, "number of dimensions" is one of the inputs of this function. The others [significant factors] are the model that you are using as your MDS model, the initial configuration of points in the MDS configuration(map) or even the order of rows/columns in the dissimilarity matrix. Therefore, you will get different stress values in 2-dimension space for instance just by changing the initial configuration of the points! [although this change in the stress value is not considerable comparing to the one resulted by change in the number of dimensions]

Now if you want to figure out the most proper number of dimensions regarding the stress value, there is a straight-forward solution: In multidimensional scaling, the pragmatic way of depicting the inverse relation of number of dimensions and stress is computing the stress for 2,3,4...,n-1 dimensions. n is the original number of dimension of the data.

The result of above computations becomes more lucid and comprehensible through "Scree plot of number of dimensions ~ amount of stress". The example below is from Cox and Cox(2001):

Now we can decide about the number of dimensions based on the relation. It is a trade-off: more dimensions-->lower stress (more accurate map) and less dimension reduction(more difficult to visualize and interpret).

Besides, the proper number of dimensions are not decided solely based on stress value. Your goal also matters. If you want to have a 2D map, then you choose 2-dimensions and then try to minimize the stress as much as possible.

Nevertheless, if you are implying "how much stress is too much" then we have another story! one way of evaluation of your magnitude of stress is comparing it to the average stress values of different possible configurations of your dataset. (have look at "Multidimensional Scaling in R: SMACOF" written by Patrick Mair).

Are the randomly generated coordinates, number of variables, and number of categories in a variable related?

Sorry but I don't understand this part of your question.

Related Solutions

Solved – How to calculate the R-squared value and assess the model fit in multidimensional scaling

You can look at the "GOF" component of the result ("goodness of fit"), if you specify the number of dimensions. It returns two numbers, that should be equal unless the distance matrix is not positive.

You can also directly look at the eigenvalues: when they become small, you have enough dimensions.

In the following example, two dimensions seem sufficient.

> cmdscale(eurodist, 1, eig=TRUE)$GOF
[1] 0.4690928 0.5401388
> cmdscale(eurodist, 2, eig=TRUE)$GOF
[1] 0.7537543 0.8679134
> cmdscale(eurodist, 3, eig=TRUE)$GOF
[1] 0.7904600 0.9101784
> r <- cmdscale(eurodist, eig=TRUE)
> plot(cumsum(r$eig) / sum(r$eig), 
       type="h", lwd=5, las=1, 
       xlab="Number of dimensions", 
       ylab=expression(R^2))
> plot(r$eig, 
       type="h", lwd=5, las=1, 
       xlab="Number of dimensions", 
       ylab="Eigenvalues")

Solved – MDS: Is Kruskal’s Stress-1 affected by scale of the data, or the number of points

Scale should make no difference. But, all else being equal, the greater the number of points, the higher the stress.

As ttnphns comments, the cause of this is that when you have fewer observations, the model will over-fit, so the stress is downwardly biased. As the number of observations grow, the extent of the bias reduces.

Pretty much every measure of goodness-of-fit with a fixed minimum and maximum, as in this case, suffers from the same problem. For example, R-squared goes down as the number of observations go up, all else being equal, and the Adjusted R-squared was developed to address this. While I do appreciate that it would be great to have measures that were not influenced by the number of observations, as the degree of "noise" is going to differ from problem to problem, this is probably not solvable (e.g., Adjusted R-squared is not used by people with a good knowledge of regression).

You can compare between different data sets by randomly sampling the number of observations in the larger data set. For example if data set 1 has 20 observations, and data set 2 has 30, randomly sample 20 from data set 2 and compare the stress with 1. If you repeat this multiple times you will be able to do a significance test comparing the stress levels.

Best Answer

Related Solutions

Solved – How to calculate the R-squared value and assess the model fit in multidimensional scaling

Solved – MDS: Is Kruskal’s Stress-1 affected by scale of the data, or the number of points

Related Question