[GIS] Significance test for Moran’s I using Monte Carlo Simulation

spatial statistics

I calculated the residuals for my model and measured Moran's I for the residuals of the model.

How can I pragmatically determine the significance of Moran's I using Monte Carlo simulation?

Best Answer

Asking for a "Monte Carlo simulation" is akin to asking to determine the significance using multiplication: like multiplication, Monte Carlo is merely a computational technique. Thus, we first have to figure out what it is we should be computing.

I would like to suggest considering a permutation test. The null hypothesis is that the data were determined and then assigned to their spatial locations at random. The alternative is that the assignment to each location depended (somehow) on the assignment at that location's neighbors. The distribution of any measure of spatial autocorrelation can be obtained by constructing all possible assignments. The permutation test does not randomize over the possible sets of data values--it considers them given--but conditional on the data observed, considers all possible ways of reassigning them to the locations.

Such a reassignment is a permutation. For n data points, there are n! = n * (n-1) * (n-2) * ... * (2) * (1) permutations. For n much larger than 10 or so, that's too many to generate. There usually is no simple analytical expression for the full permutation distribution, either. Accordingly, we typically resort to sampling from the set of all permutations at random, giving them all equal weight. The distribution of the autocorrelation statistic in a sufficiently large sample (usually involving at least 500 permutations) approximates the true distribution. This is the "Monte Carlo" part of the calculation.

In the case of Moran's I, a large positive value is evidence of positive correlation and a large negative value is evidence of negative correlation. You can test for either or both by locating the actual value of I (for the data as observed) within the permutation distribution. The p-value is the proportion of the permutation distribution (including the observed value of I) that is more extreme than the observed statistic.

The following examples will illustrate. In both, data on a 5 by 8 grid were generated. The weights are proportional to 1 for cells that share a side, to sqrt(1/2) = 0.7071 for cells that only share a corner, and to 0 otherwise. The left column maps the data (yellow is higher than red); the middle column shows histograms of the data; and the right column displays the estimated permutation distribution of Moran's I (using a sample of 10,000). The vertical red dotted lines show the actual values of Moran's I in the data.

![Figure

In the case of data produced by a mechanism that induces some autocorrelation (the top row), this value is near the higher extremes of the distribution: only 4.53% equal or exceed it. This often would be considered significant evidence of positive autocorrelation. In the case of data generated with a purely random mechanism (the bottom row), Moran's I is near the middle of the distribution (almost exactly at its mean value of -1/39 = -0.026). Nobody would confuse this with a significant result. In these examples, then, the permutation test is performing as intended.

Notice the highly skewed distribution of the autocorrelated data. That was done intentionally. Such skewed data do not conform to the (implicit) assumptions used to justify standard interpretations of Moran's I. (The skewed null distribution testifies to this: the standard interpretations assume it has no skew at all.) It is precisely in such cases that the permutation test produces more reliable results than a "canned" test based on referring a "Z score" to a Normal distribution.

For testing model residuals, in each iteration you would usually permute the residuals, refit the model, and compute Moran's I from those new residuals.


References

Phillip Good, Permutation, Parametric, and Bootstrap Tests of Hypotheses. Third Edition (2005), Springer.


The following R code generated the examples. It is not intended for production work: it is inefficient in its generation of the weights matrix and is highly space-inefficient in storing it. (Its advantage is that it computes Moran's I quickly once the weights have been determined, which is ideal for a permutation test because the weights never change.)

#
# Moran's I
# https://en.wikipedia.org/wiki/Moran's_I
#
# Data locations.
#
x <- 1:8 # x-coordinates of points
y <- 1:5 # y-coordinates of points
#
# Weights matrix.
#
ind <- expand.grid(i=1:length(x), j=1:length(y))
f <- function(i, j) {
  u <- min(3, sum(abs(ind[i, ] - ind[j, ])))
  c(0, 1, sqrt(1/2), 0)[u+1]
}
w <- matrix(0.0, nrow(ind), nrow(ind))
for (i in 1:nrow(ind)) for (j in 1:nrow(ind)) w[i,j] <- f(i,j)
w <- w / sum(w)
#
# Moran's I.
#
moran <- function(x, weights) {
  n <- length(x)
  z <- as.vector((x - mean(x)) / sd(x))
  as.vector(z %*% weights %*% (z * sqrt(n / (n-1))))
}
#
# Permutation test (including a plot).
#
ppt <- function(z, w, N=1e4, ...) {
  stat <- moran(z, w)
  sim <- replicate(N, moran(sample(z, length(z)), w))
  p.value <- mean((all <- c(stat, sim)) >= stat)
  hist(sim, sub=paste("p =", round(p.value, 4)), xlim=range(all), ...)
  abline(v = stat, col="#903030", lty=3, lwd=2)
  return(p.value)
}
#
# Simulated observations.
#
set.seed(17)
par(mfrow=c(2,3))

# Induce autocorrelation via an underlying trend.
z <- matrix(rexp(length(x)*length(y), outer(x,y^2)), length(x))
image(log(z), main="Autocorrelated Data")
hist(z)
ppt(z, w, main="Null Distribution of Moran's I", xlab="I")

# Generate data independently of location.
z <- matrix(rnorm(length(x)*length(y), 0, 1/2), length(x))
image(z, main="Uncorrelated Data")
hist(z)
ppt(z, w, main="Null Distribution of Moran's I", xlab="I")
Related Question