Solved – NMDS and variance explained by vector fitting

multidimensional scalingr-squaredvariance

I just ran a non metric multidimensional scaling model (nmds) which compared multiple locations based on benthic invertebrate species composition. After running the analysis, I used the vector fitting technique to see how the resulting ordination would relate to some environmental variables.

My first question is: I got an R squared value of .18 for a variable representing "depth" at a site. It has the highest R square value of any environmental variable by far (the rest are less than 0.07) and is the only variable which is significant at the alpha=0.05 level.
I'm wondering if there are any thresholds for how much variance explained, is a good amount. I know it might depend on the scientific discipline, I've heard things like 30% for ecology, but I can't track down a reference for this.

My second question is, am I right in saying that for this case the % of variance explained, is the % of variance in the depth data explained by the nmds ordination axes? Thanks!

Best Answer

I wouldn't place much stock in "rules of thumb" such as this. It is dependent upon so many things such as the number of variables, the number of sites, what dissimilarity you use etc. Also note that the vector fitting approach is inherently linear and we have no reason to presume that the relationship between the variable and the NMDS configuration is linear.

The key thing is that it is a small/modest correlation but that it is significant. But you probably want to look at the linearity assumption.

In the vegan package for R we have function ordisurf() for this. It fits a 2-d surface to an NMDS solution using a GAM via function gam() in package mgcv. It essentially fits the model

$$g(\mu_i) = \eta_i = \beta_0 + f(nmds_{1i}, nmds_{2i})$$

where $f$ is a 2-d smooth function of the 1st and 2nd axes of the NMDS, $g$ is the link function, $\mu$ the expectation of the response, and $\eta$ is the linear predictor. The error distribution is a member of the exponential family of distributions. The function $f$ can be isotropic in which case we use smooths formed by s() employing by default thin plate splines. Anisotropic surfaces can be fitted too, where we use smooths formed by te(); tensor product smooths.

The complexity of the smooth is chosen during fitting using a, by default in the development versions, REML criterion. GCV smoothness selection is also available.

Here is an R example using one of the in-built data sets provided with vegan

require("vegan")
data(dune)
data(dune.env)

## fit NMDS using Bray-Curtis dissimilarity (default)
set.seed(12)
sol <- metaMDS(dune)

## NMDS plot
plot(sol)
## Fit and add the 2d surface
sol.s <- ordisurf(sol ~ A1, data = dune.env, method = "REML", 
                  select = TRUE)
## look at the fitted model
summary(sol.s)

This produces

> summary(sol.s)

Family: gaussian 
Link function: identity 

Formula:
y ~ s(x1, x2, k = knots[1], bs = bs[1])
<environment: 0x2fb78a0>

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.8500     0.4105   11.81 9.65e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
           edf Ref.df     F p-value  
s(x1,x2) 1.591      9 0.863  0.0203 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =   0.29   Deviance explained =   35%
REML score = 41.587  Scale est. = 3.3706    n = 20

and

enter image description here

In this case a linear vector fit seems reasonable for this variable. Read ?ordisurf for details on the arguments used, especially what select = TRUE does.

Related Solutions

Solved – vector fit interpretation NMDS

Vector fitting is a regression. Explicitly, the model fitted is

$$y = \beta_1 X_1 + \beta_2 X_2 + \varepsilon$$

where $y$ is the environmental variable requiring a vector, $X_i$ is the $i$th ordination "axis" score (here for the first two ordination "axes") and $\varepsilon$ the unexplained variance. Both $y$ and $X_i$ are centred prior to fitting the model, hence no intercept. The $\hat{\beta}_j$ are the coordinates of the vector for $y$ in the ordination space spanned by the $i$ ordination axes; these may be normalised to unit length.

As this is a regression, $R^2$ is easily computed and so could the significance of the coefficients or $R^2$. However, we presume that the model assumptions are not fully met and hence we use a permutation test to test significance of the $R^2$ of the model.

The permutation test doesn't create the overall $R^2$, what is done is that we permute the values of the response $y$ into random order. Next we use the fitted regression model (equation above) to predict the randomised response data and compute the $R^2$ between the randomised response and the fitted values from the model. This $R^2$ value is recorded and then the procedure is done again with a different random permutation. We keep doing this a modest number of times (say 999). Under the null hypothesis of no relationship between the ordination "axis" scores and the environmental variable, the observed $R^2$ value should be a common value among the permuted $R^2$ values. If however the observed $R^2$ is extreme relative to the permutation distribution of $R^2$ then it is unlikely that the Null hypothesis is correct as we have substantial evidence against it. The proportion of times a randomised $R^2$ from the distribution is equal to or greater than the observed $R^2$ is a value known as the permutation $p$ value.

An example, fully worked may help with this. Using the vegan package for R and some in-built data

require(vegan)
data(varespec)
data(varechem)

## fit PCA
ord <- rda(varespec)

## fit vector for Al - gather data
dat <- cbind.data.frame(Al = varechem$Al,
                        scores(ord, display = "sites", scaling = 1))

## fit the model
mod <- lm(Al ~ PC1 + PC2, data = dat)
summary(mod)

This gives

> summary(mod)

Call:
lm(formula = Al ~ PC1 + PC2, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-172.30  -58.00  -12.54   58.44  239.46 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  142.475     19.807   7.193 4.34e-07 ***
PC1           31.143      9.238   3.371  0.00289 ** 
PC2           27.492     13.442   2.045  0.05356 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 97.04 on 21 degrees of freedom
Multiple R-squared:  0.4254,    Adjusted R-squared:  0.3707 
F-statistic: 7.774 on 2 and 21 DF,  p-value: 0.002974

Note the value for the Multiple R-squared (0.4254).

vegan has a canned function for doing all of this, on multiple environmental variables at once; envfit(). Compare the $R^2$ from above with the vector-fitted value (to keep things simple I just do Al here, but you could pass all of varechem and envfit would fit vectors [centroids for factors] for all variables.)

set.seed(42) ## make this reproducible - pseudo-random permutations!
envfit(ord, varechem[, "Al", drop = FALSE])

> envfit(ord, varechem[, "Al", drop = FALSE])

***VECTORS

       PC1     PC2     r2 Pr(>r)   
Al 0.85495 0.51871 0.4254  0.004 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
P values based on 999 permutations.

The two $R^2$ values shown are exactly the same.

[Do note that envfit doesn't actually fit models via lm internally - it uses a QR decomposition. This is the same methods employed deeper down in lm but we call it directly to fit the model manually as we want it without the extra things that something like lm.fit would give us.]

Solved – NMDS: why is the r-squared for a factor variable so low

Let me take a crack. I believe factorfit finds average ordination scores for each level of your factor (and places an arrowhead), and performs a regression (with goodness of fit)... Although it appears that you have separation due to your factor in nMDS space, the R^2 is likely low simply due to high variance in ordination score around an expected regression score. This could simply be due to other variables that you include in your nMDS ordination, and if you only included [species] important to the distinction between habitat, your R^2 would of course go up. Remember that the axes in nMDS are not linear combinations of descriptors... Thus points are arranged by similarity in k-dimensions, and not according to scores of any specific variable.

Best Answer

Related Solutions

Solved – vector fit interpretation NMDS

Solved – NMDS: why is the r-squared for a factor variable so low

Related Question