Solved – vector fit interpretation NMDS

correlationpcarr-squared

So a colleague and myself are using principal component analysis (PCA) or non metric multidimensional scaling (NMDS) to examine how environmental variables influence patterns in benthic community composition. A common method is to fit environmental vectors on to an ordination. The length and direction of the vectors seems somewhat straighforward but I don't understand how an R squared value or a p-value is calculated for these vectors. I have looked at a dozen papers and the most I can gather is that these numbers are calculated using permutations of the data. This does not seem very intuitive. What data is being permuted? How does this permutation create an R squared value and what variance is being explained? My limited understanding of an R squared value comes from linear regressions. I need to explain this to people who have little to no background in statistics so any help understanding these concepts or a link to an available text would be greatly appreciated. Thanks so much!

Best Answer

Vector fitting is a regression. Explicitly, the model fitted is

$$y = \beta_1 X_1 + \beta_2 X_2 + \varepsilon$$

where $y$ is the environmental variable requiring a vector, $X_i$ is the $i$th ordination "axis" score (here for the first two ordination "axes") and $\varepsilon$ the unexplained variance. Both $y$ and $X_i$ are centred prior to fitting the model, hence no intercept. The $\hat{\beta}_j$ are the coordinates of the vector for $y$ in the ordination space spanned by the $i$ ordination axes; these may be normalised to unit length.

As this is a regression, $R^2$ is easily computed and so could the significance of the coefficients or $R^2$. However, we presume that the model assumptions are not fully met and hence we use a permutation test to test significance of the $R^2$ of the model.

The permutation test doesn't create the overall $R^2$, what is done is that we permute the values of the response $y$ into random order. Next we use the fitted regression model (equation above) to predict the randomised response data and compute the $R^2$ between the randomised response and the fitted values from the model. This $R^2$ value is recorded and then the procedure is done again with a different random permutation. We keep doing this a modest number of times (say 999). Under the null hypothesis of no relationship between the ordination "axis" scores and the environmental variable, the observed $R^2$ value should be a common value among the permuted $R^2$ values. If however the observed $R^2$ is extreme relative to the permutation distribution of $R^2$ then it is unlikely that the Null hypothesis is correct as we have substantial evidence against it. The proportion of times a randomised $R^2$ from the distribution is equal to or greater than the observed $R^2$ is a value known as the permutation $p$ value.

An example, fully worked may help with this. Using the vegan package for R and some in-built data

require(vegan)
data(varespec)
data(varechem)

## fit PCA
ord <- rda(varespec)

## fit vector for Al - gather data
dat <- cbind.data.frame(Al = varechem$Al,
                        scores(ord, display = "sites", scaling = 1))

## fit the model
mod <- lm(Al ~ PC1 + PC2, data = dat)
summary(mod)

This gives

> summary(mod)

Call:
lm(formula = Al ~ PC1 + PC2, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-172.30  -58.00  -12.54   58.44  239.46 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  142.475     19.807   7.193 4.34e-07 ***
PC1           31.143      9.238   3.371  0.00289 ** 
PC2           27.492     13.442   2.045  0.05356 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 97.04 on 21 degrees of freedom
Multiple R-squared:  0.4254,    Adjusted R-squared:  0.3707 
F-statistic: 7.774 on 2 and 21 DF,  p-value: 0.002974

Note the value for the Multiple R-squared (0.4254).

vegan has a canned function for doing all of this, on multiple environmental variables at once; envfit(). Compare the $R^2$ from above with the vector-fitted value (to keep things simple I just do Al here, but you could pass all of varechem and envfit would fit vectors [centroids for factors] for all variables.)

set.seed(42) ## make this reproducible - pseudo-random permutations!
envfit(ord, varechem[, "Al", drop = FALSE])

> envfit(ord, varechem[, "Al", drop = FALSE])

***VECTORS

       PC1     PC2     r2 Pr(>r)   
Al 0.85495 0.51871 0.4254  0.004 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
P values based on 999 permutations.

The two $R^2$ values shown are exactly the same.

[Do note that envfit doesn't actually fit models via lm internally - it uses a QR decomposition. This is the same methods employed deeper down in lm but we call it directly to fit the model manually as we want it without the extra things that something like lm.fit would give us.]

Related Question