Solved – NMDS and variance explained by vector fitting

multidimensional scalingr-squaredvariance

I just ran a non metric multidimensional scaling model (nmds) which compared multiple locations based on benthic invertebrate species composition. After running the analysis, I used the vector fitting technique to see how the resulting ordination would relate to some environmental variables.

My first question is: I got an R squared value of .18 for a variable representing "depth" at a site. It has the highest R square value of any environmental variable by far (the rest are less than 0.07) and is the only variable which is significant at the alpha=0.05 level.
I'm wondering if there are any thresholds for how much variance explained, is a good amount. I know it might depend on the scientific discipline, I've heard things like 30% for ecology, but I can't track down a reference for this.

My second question is, am I right in saying that for this case the % of variance explained, is the % of variance in the depth data explained by the nmds ordination axes? Thanks!

Best Answer

I wouldn't place much stock in "rules of thumb" such as this. It is dependent upon so many things such as the number of variables, the number of sites, what dissimilarity you use etc. Also note that the vector fitting approach is inherently linear and we have no reason to presume that the relationship between the variable and the NMDS configuration is linear.

The key thing is that it is a small/modest correlation but that it is significant. But you probably want to look at the linearity assumption.

In the vegan package for R we have function ordisurf() for this. It fits a 2-d surface to an NMDS solution using a GAM via function gam() in package mgcv. It essentially fits the model

$$g(\mu_i) = \eta_i = \beta_0 + f(nmds_{1i}, nmds_{2i})$$

where $f$ is a 2-d smooth function of the 1st and 2nd axes of the NMDS, $g$ is the link function, $\mu$ the expectation of the response, and $\eta$ is the linear predictor. The error distribution is a member of the exponential family of distributions. The function $f$ can be isotropic in which case we use smooths formed by s() employing by default thin plate splines. Anisotropic surfaces can be fitted too, where we use smooths formed by te(); tensor product smooths.

The complexity of the smooth is chosen during fitting using a, by default in the development versions, REML criterion. GCV smoothness selection is also available.

Here is an R example using one of the in-built data sets provided with vegan

require("vegan")
data(dune)
data(dune.env)

## fit NMDS using Bray-Curtis dissimilarity (default)
set.seed(12)
sol <- metaMDS(dune)

## NMDS plot
plot(sol)
## Fit and add the 2d surface
sol.s <- ordisurf(sol ~ A1, data = dune.env, method = "REML", 
                  select = TRUE)
## look at the fitted model
summary(sol.s)

This produces

> summary(sol.s)

Family: gaussian 
Link function: identity 

Formula:
y ~ s(x1, x2, k = knots[1], bs = bs[1])
<environment: 0x2fb78a0>

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.8500     0.4105   11.81 9.65e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
           edf Ref.df     F p-value  
s(x1,x2) 1.591      9 0.863  0.0203 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =   0.29   Deviance explained =   35%
REML score = 41.587  Scale est. = 3.3706    n = 20

and

enter image description here

In this case a linear vector fit seems reasonable for this variable. Read ?ordisurf for details on the arguments used, especially what select = TRUE does.