[GIS] Robust alternatives to Moran’s I

algorithmrspatial statistics

Moran’s I, a measure of spatial autocorrelation, is not a particularly robust statistic (it can be sensitive to skewed distributions of the spatial data attributes).

What are some more robust techniques for measuring spatial autocorrelation? I’m particularly interested in solutions that are readily available/implementable in a scripting language like R. If solutions apply to unique circumstances/data distributions, please specify those in your answer.

EDIT: I’m expanding the question with a few examples (in response to comments/answers to the original question)

It’s been suggested that permutation techniques (where a Moran’s I sampling distribution is generated using a Monte Carlo procedure) offers a robust solution. My understanding is that such test eliminates the need to make any assumptions about the Moran’s I distribution (given that the test statistic can be influenced by the spatial structure of the dataset) but, I fail to see how the permutation technique corrects for non-normally distributed attribute data. I offer two examples: one that demonstrates the influence of skewed data on local Moran’s I statistic, the other on global Moran’s I-–even under permutation tests.

I'll use Zhang et al.'s(2008) analyses as the first example. In their paper, they show attribute data distribution's influence on the local Moran’s I using permutation tests (9999 simulations). I’ve reproduced the authors’ hotspot results for lead (Pb) concentrations (at 5% confidence level) using the original data (left panel) and a log transformation of that same data (right panel) in GeoDa. Boxplots of the original and log-transformed Pb concentrations are also presented. Here, the number of significant hot spots nearly doubles when the data are transformed; this example shows that the local statistic is sensitive to attribute data distribution–even when using Monte Carlo techniques!

enter image description here

The second example (simulated data) demonstrates the influence skewed data can have on the global Moran’s I, even when using permutation tests. An example, in R, follows:

library(spdep)
library(maptools)
NC <- readShapePoly(system.file("etc/shapes/sids.shp", package="spdep")[1],ID="FIPSNO", proj4string=CRS("+proj=longlat +ellps=clrk66"))
rn <- sapply(slot(NC, "polygons"), function(x) slot(x, "ID"))
NB <- read.gal(system.file("etc/weights/ncCR85.gal", package="spdep")[1], region.id=rn)
n  <- length(NB)
set.seed(4956)
x.norm <- rnorm(n) 
rho    <- 0.3          # autoregressive parameter
W      <- nb2listw(NB) # Generate spatial weights
# Generate autocorrelated datasets (one normally distributed the other skewed)
x.norm.auto <- invIrW(W, rho) %*% x.norm # Generate autocorrelated values
x.skew.auto <- exp(x.norm.auto) # Transform orginal data to create a 'skewed' version
# Run permutation tests
MCI.norm <- moran.mc(x.norm.auto, listw=W, nsim=9999)
MCI.skew <- moran.mc(x.skew.auto, listw=W, nsim=9999)
# Display p-values
MCI.norm$p.value;MCI.skew$p.value

Note the difference in P-values. The skewed data indicates that there is no clustering at a 5% significance level (p=0.167) whereas the normally distributed data indicates that there is (p=0.013).

Chaosheng Zhang, Lin Luo, Weilin Xu, Valerie Ledwith, Use of local Moran's I and GIS to identify pollution hotspots of Pb in urban soils of Galway, Ireland, Science of The Total Environment, Volume 398, Issues 1–3, 15 July 2008, Pages 212-221

Best Answer

(This is just too unwieldy at this point to turn into a comment)

This is in regards to local and global tests (not a specific, sample independent measure of auto-correlation). I can appreciate that the specific Moran's I measure is a biased estimate of the correlation (interpreting it in the same terms as Pearson correlation coefficient), I still don't see how the permutation hypothesis test is sensitive to the original distribution of the variable (either in terms of type 1 or type 2 errors).

Slightly adapting the code you provided in the comment (the spatial weights colqueen was missing);

library(spdep)
data(columbus)
attach(columbus)

colqueen <- nb2listw(col.gal.nb, style="W") #weights object was missing in original comment
MC1 <- moran.mc(PLUMB,colqueen,999)
MC2 <- moran.mc(log(PLUMB),colqueen,999)
par(mfrow = c(2,2))
hist(PLUMB, main = "Histogram PLUMB")
hist(log(PLUMB), main = "HISTOGRAM log(PLUMB)")
plot(MC1, main = "999 perm. PLUMB")
plot(MC2, main = "999 perm. log(PLUMB)")

When one conducts permutation tests (in this instance, I like to think of it as jumbling up space) the hypothesis test of global spatial auto-correlation should not be impacted by the distribution of the variable, as the simulated test distribution will in essence change with the distribution of the original variables. Likely one could come up with more interesting simulations to demonstrate this, but as you can see in this example, the observed test statistics is well outside of the generated distribution for both the original PLUMB and the logged PLUMB (which is much closer to a normal distribution). Although you can see the logged PLUMB test distribution under the null shifts closer to symmetry about 0.

enter image description here

I was going to suggest this as a alternative anyway, transforming the distribution to be approximately normal. I was also going to suggest looking up resources on spatial filtering (and similarly the Getis-Ord local and global statistics), although I'm not sure this will help with a scale free measure either (but perhaps may be fruitful for hypothesis tests). I will post back later with potentially more literature of interest.

Related Solutions

[GIS] Simulation envelopes and significance levels

The +1 is a convention. It's all about converting the ranks to percentiles. Consider 99 iterations. The rank will run from 1 through 99 (in whole steps). You can convert the rank to a percentile by dividing by 99 and multiplying by 100. That would produce percentiles from 100/99 = 1.01% to 99*100/99 = 100%. That lacks a desirable symmetry: you're saying the lowest value is 1.01% from the bottom but the highest value is right smack at the top of the range. To restore that symmetry, you should shift the ranks down by 0.5: the lowest value gets assigned a percentile of (1-0.5)*100/99 = 0.50% and the highest value gets a percentile of (99-0.5)*100/99 = 99.50% = 100 - 0.50%. Everything becomes nice and symmetric. You can avoid adjusting the ranks simply by adding 1 to the denominator. Now the percentiles in the example range from 1*100/100 = 1% up to 99% in even, symmetrical, 1% steps. (I illustrate this, and discuss it more generally, at http://www.quantdec.com/envstats/notes/class_02/characterizing_distributions.htm : look towards the bottom of that page.)
The purpose of the simulation is to compute what is likely to happen by chance alone (the "null distribution"). For instance, we can ask whether Joe Paterno's football bowl record at Penn State of 24 wins in 37 attempts is just the result of blind luck, no better (or worse) than if each game had been decided by a fair coin flip. To do so, we might actually flip a coin 37 times, count the heads (to represent wins), and repeat this for a total of 99 iterations. We would find that very few of those 99 attempts achieved 24 or more heads--most likely 5 or fewer. That's pretty good evidence that over time his teams were better than the bowl competition, indicating they tended to be under-rated. (99 is too small a number to use here, really: I actually used 100,000, in which I observed 4972 occasions of 24 or more heads.)

The one aspect of this example that you control is the weight of evidence: is a 5/100 chance enough to convince you the result was not due to randomness (or luck)? Depending on the circumstance, some people need stronger evidence, others weaker. That's the role of m. When you draw the envelopes at the most extreme out of (say) 99 iterations, then you are estimating the lowest 1% and the highest 1%. You would guess that random fluctuations would place the curve between these envelopes 100 - 1 - 1 = 98% of the time. This corresponds (very roughly, because 99 iterations is still too little) to a "two-sided p-value of 0.02." If you don't need such strong evidence, you might choose (say) m = 3. Now the lower envelope represents the bottom 3/99 = 3.03% of the chance distribution and the upper envelope represents the top 3.03% of that distribution. Your two-sided p-value is around 6%. (Because 99 is so small, the true p-value could be as large as 15% or so, so beware! Do many more Monte-Carlo iterations if you possibly can.)
In some sense you're trying to depict a distribution of envelopes (curves). That's a complicated thing. One way is to choose some percentiles in advance, such as 1% (and therefore, by symmetry, 99%), 5% (and 95%), 10% (and 90%), 25% (and 75%). Out of 99 simulated curves these would correspond roughly to the lowest (and highest), the fifth lowest (and fifth highest), etc. (Comparing curves that can intersect each other in this way is problematic, though: when they cross, which is more extreme than the other?) Plotting those selected curves at least gives one a visual idea of the spread that is produced by chance mechanisms alone.

I hope you get the sense that this approach of generating a small bunch of curves (under the null hypothesis) and drawing a selected few is exploratory, approximate, and somewhat crude, because it is all of those. But it's far better than just guessing whether or not your data could have arisen by chance.

[GIS] R raster Package Moran’s I interpretation

The formula for global Moran's I is:

$I = \frac{N} {\sum_{i} \sum_{j} w_{ij}}\frac {\sum_i\sum_jw_{ij}(X_i-\bar X) (X_j-\bar X)} {\sum_i (X_i-\bar X)^2}$

where i is an index of analysis units (basically, measurement units of of your map, or in your case pixels in the raster) and j is an index of the neighbors of each map unit. The formula for local Moran's I is extremely similar, except that since local Moran's I is calculated separately for each analysis unit indexed by i, in the top part of the fraction you don't need to sum over i:

$I_i = N(X_i-\bar X)\frac {\sum_jw_{ij} (X_j-\bar X)} {\sum_i (X_i-\bar X)^2$

Values for $X_i$ and $X_j$ will be distributed around the mean, so, intuitively, over the entire study area high and low clusters will offset each other and global Moran's I will be constrained to lie between -1 and 1. But for local Moran's I, a cluster (high, low, doesn't matter) will be comprised of values where $X_i$ and $X_j$ deviate significantly from the mean, and therefore the top part of the fraction in the second equation will be large in absolute value, much larger than the global deviation from the mean captured in the bottom part of the fraction by $(X_i-\bar X)^2$ .

In your constructed example, you can see this clearly. The top rows are low values, the middle rows are near the mean, and the bottom rows are high values. Therefore, as demonstrated in your second plot, local Moran's I is high in the top and bottom rows, because those rows contain values far from the mean. Local Moran's I is near 0 in the middle rows, because those values are all near the mean. Your example does not show dispersion (the classic checkerboard pattern), so local Moran's I is not negative anywhere.

Let's calculate $I_i$ by hand for one of the pixels. Pixel number 15 has eight neighbors with values 4, 5, 6, 14, 16, 24, 25, 26. So:

x = 1:100
Ii = length(x) * 
  (15 - mean(x)) * 
  sum(1 * (c(4, 5, 6, 14, 16, 24, 25, 26) - mean(x))) / 
  sum((x - mean(x))^2)
Ii
# [1] 12.09961

Incidentally, this does not equal the same value for pixel 15 produced by MoranLocal:

x1[15]
# 1.512451

At first I thought I did something wrong, so I created a vector 10x10 grid in vector format that was an exact analog of the 10x10 raster and ran it through the localmoran function in package spdep. It turns out that MoranLocal is calculating $I_i$ using a row-standardized weights matrix, whereas the formula I included above is based on using a simple binary queen's contiguity matrix. spdep gives you control over these options. Using the row-standardized matrix, the $w_{ij}$ are 1/8 (eight neighbors at 1/8 each sum to 1), so:

x = 1:100
Ii = length(x) * 
  (15 - mean(x)) * 
  sum(0.125 * (c(4, 5, 6, 14, 16, 24, 25, 26) - mean(x))) / 
  sum((x - mean(x))^2)
Ii
# [1] 1.512451

The original source for local Moran's I is Anselin (1995), "Local Indicators of Spatial Association—LISA" (appears to be open access).

Best Answer

Related Solutions

[GIS] Simulation envelopes and significance levels

[GIS] R raster Package Moran’s I interpretation

Related Question