So a colleague and myself are using principal component analysis (PCA) or non metric multidimensional scaling (NMDS) to examine how environmental variables influence patterns in benthic community composition. A common method is to fit environmental vectors on to an ordination. The length and direction of the vectors seems somewhat straighforward but I don't understand how an R squared value or a p-value is calculated for these vectors. I have looked at a dozen papers and the most I can gather is that these numbers are calculated using permutations of the data. This does not seem very intuitive. What data is being permuted? How does this permutation create an R squared value and what variance is being explained? My limited understanding of an R squared value comes from linear regressions. I need to explain this to people who have little to no background in statistics so any help understanding these concepts or a link to an available text would be greatly appreciated. Thanks so much!
Solved – vector fit interpretation NMDS
correlationpcarr-squared
Related Solutions
I wouldn't place much stock in "rules of thumb" such as this. It is dependent upon so many things such as the number of variables, the number of sites, what dissimilarity you use etc. Also note that the vector fitting approach is inherently linear and we have no reason to presume that the relationship between the variable and the NMDS configuration is linear.
The key thing is that it is a small/modest correlation but that it is significant. But you probably want to look at the linearity assumption.
In the vegan package for R we have function ordisurf()
for this. It fits a 2-d surface to an NMDS solution using a GAM via function gam()
in package mgcv. It essentially fits the model
$$g(\mu_i) = \eta_i = \beta_0 + f(nmds_{1i}, nmds_{2i})$$
where $f$ is a 2-d smooth function of the 1st and 2nd axes of the NMDS, $g$ is the link function, $\mu$ the expectation of the response, and $\eta$ is the linear predictor. The error distribution is a member of the exponential family of distributions. The function $f$ can be isotropic in which case we use smooths formed by s()
employing by default thin plate splines. Anisotropic surfaces can be fitted too, where we use smooths formed by te()
; tensor product smooths.
The complexity of the smooth is chosen during fitting using a, by default in the development versions, REML criterion. GCV smoothness selection is also available.
Here is an R example using one of the in-built data sets provided with vegan
require("vegan")
data(dune)
data(dune.env)
## fit NMDS using Bray-Curtis dissimilarity (default)
set.seed(12)
sol <- metaMDS(dune)
## NMDS plot
plot(sol)
## Fit and add the 2d surface
sol.s <- ordisurf(sol ~ A1, data = dune.env, method = "REML",
select = TRUE)
## look at the fitted model
summary(sol.s)
This produces
> summary(sol.s)
Family: gaussian
Link function: identity
Formula:
y ~ s(x1, x2, k = knots[1], bs = bs[1])
<environment: 0x2fb78a0>
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.8500 0.4105 11.81 9.65e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x1,x2) 1.591 9 0.863 0.0203 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.29 Deviance explained = 35%
REML score = 41.587 Scale est. = 3.3706 n = 20
and
In this case a linear vector fit seems reasonable for this variable. Read ?ordisurf
for details on the arguments used, especially what select = TRUE
does.
If you want to produce an NMDS plot with one point for each site, you will first need to pool your sampling points to produce a single community for each site. You could produce separate plots like this for each year, or have them all on the same plot e.g. plot1_year1, plot1_year2 etc...
Alternatively, you could keep your data having one row for each sampling point. You could then plot all of the sampling points, and give each point a colour corresponding to which site it is from. This will allow you to visualise whether sampling points from the same site cluster together. Check out vignettes from the R package vegan
for examples of how to do this.
I'm not clear on what the point of the replication was... Maybe just pool your replicates to give a single row per sampling point.
It sounds like sampling intensity was identical between sampling points and sites, but you might want to think about this to make sure.
Once you have some NMDS plots, you can fit your environmental variables on to them using the envfit
function. This function can be used to test whether the correlations are significant using permutations - the data does not need to be normal.
If you want to test for effect of specific environmental variables, you will need to take into account spatial autocorrelation - sites that are far apart are likely to differ more in community composition and environmental variables than sites that are close together. To take this into account you can use partial mantel tests. In a similar way to how your community data is transformed into a distance matrix for NMDS, you need to construct a distance matrix for your sites based on geographic distance. The partial mantel test can then partial out the effect of geographic distance to show whether your environmental variables are still important.
You could also carry out exploratory partial mantel analysis, assessing the independent importance of matrices of related environmental variables with effects of other matrices removed. This involves sequentially testing the importance of each variable on community composition once the effects of remaining matrices are partialled out from the analysis.
p.s. short answers to your questions:
Do I need to test for normality first - no
Do I need to average my abundance data for the sampling points and replicates? - Yes, but I would pool (sum) them rather than average
How can I incorporate the environmental data into my nMDS? - Use envfit and partial mantel tests
Best Answer
Vector fitting is a regression. Explicitly, the model fitted is
$$y = \beta_1 X_1 + \beta_2 X_2 + \varepsilon$$
where $y$ is the environmental variable requiring a vector, $X_i$ is the $i$th ordination "axis" score (here for the first two ordination "axes") and $\varepsilon$ the unexplained variance. Both $y$ and $X_i$ are centred prior to fitting the model, hence no intercept. The $\hat{\beta}_j$ are the coordinates of the vector for $y$ in the ordination space spanned by the $i$ ordination axes; these may be normalised to unit length.
As this is a regression, $R^2$ is easily computed and so could the significance of the coefficients or $R^2$. However, we presume that the model assumptions are not fully met and hence we use a permutation test to test significance of the $R^2$ of the model.
The permutation test doesn't create the overall $R^2$, what is done is that we permute the values of the response $y$ into random order. Next we use the fitted regression model (equation above) to predict the randomised response data and compute the $R^2$ between the randomised response and the fitted values from the model. This $R^2$ value is recorded and then the procedure is done again with a different random permutation. We keep doing this a modest number of times (say 999). Under the null hypothesis of no relationship between the ordination "axis" scores and the environmental variable, the observed $R^2$ value should be a common value among the permuted $R^2$ values. If however the observed $R^2$ is extreme relative to the permutation distribution of $R^2$ then it is unlikely that the Null hypothesis is correct as we have substantial evidence against it. The proportion of times a randomised $R^2$ from the distribution is equal to or greater than the observed $R^2$ is a value known as the permutation $p$ value.
An example, fully worked may help with this. Using the vegan package for R and some in-built data
This gives
Note the value for the
Multiple R-squared
(0.4254).vegan has a canned function for doing all of this, on multiple environmental variables at once;
envfit()
. Compare the $R^2$ from above with the vector-fitted value (to keep things simple I just doAl
here, but you could pass all ofvarechem
andenvfit
would fit vectors [centroids for factors] for all variables.)The two $R^2$ values shown are exactly the same.
[Do note that
envfit
doesn't actually fit models vialm
internally - it uses a QR decomposition. This is the same methods employed deeper down inlm
but we call it directly to fit the model manually as we want it without the extra things that something likelm.fit
would give us.]