[GIS] Simulation envelopes and significance levels

rsimulationspatial statistics

Many spatial analysis packages provide Monte Carlo techniques to simulate upper and lower “envelopes” for a summary statistic as a function of distance (e.g. K-function, nearest neighbor, etc…). I’ve sometimes seen the significance of deviation from the envelope expressed as:

p = m * 1 / (n +1)

Where n is the number of simulations and m is the rank of the largest and smallest observation from a simulation sample for each distance r. For example, if m = 1, the first largest and first smallest values from the simulation sample are used to plot the upper and lower simulation envelopes. If m = 2, the second largest and second smallest values are used to plot the simulation envelopes.

My questions are:

1/ where does the + 1 in the denominator originate from? For instance, if we run 39 simulations, why divide by 40?

2/ Regardless of the number of simulations performed (i.e. 39 or 9999), m seems to have a disproportionate influence on the computation of the significance level. It would seem that taking the 2nd highest and lowest values of a simulation sample from 9999 simulations would have less of an impact on the width of the simulation envelope than from 39 simulations. I’m sure that there is a sound theoretical basis for this, but its logic escapes me. Is there an analogy in inferential (non-spatial) statistics that can help make this a bit more intuitive?

3/ How should one present the results of a simulation envelope? I sometimes see a p value defined for a simulation, but how is one to know if the presented p is for m=1 or a more stringent m=3?

Edit: Per whuber's comment here is some clarification:

1/ The type of simulation pertains to testing the null hypothesis such as complete spatial randomness (CSR)

2/ Example of a package that computes simulation envelopes is spatstat (R). The function is called envelope and nrank is the parameter m described in the above equation.

Best Answer

The +1 is a convention. It's all about converting the ranks to percentiles. Consider 99 iterations. The rank will run from 1 through 99 (in whole steps). You can convert the rank to a percentile by dividing by 99 and multiplying by 100. That would produce percentiles from 100/99 = 1.01% to 99*100/99 = 100%. That lacks a desirable symmetry: you're saying the lowest value is 1.01% from the bottom but the highest value is right smack at the top of the range. To restore that symmetry, you should shift the ranks down by 0.5: the lowest value gets assigned a percentile of (1-0.5)*100/99 = 0.50% and the highest value gets a percentile of (99-0.5)*100/99 = 99.50% = 100 - 0.50%. Everything becomes nice and symmetric. You can avoid adjusting the ranks simply by adding 1 to the denominator. Now the percentiles in the example range from 1*100/100 = 1% up to 99% in even, symmetrical, 1% steps. (I illustrate this, and discuss it more generally, at http://www.quantdec.com/envstats/notes/class_02/characterizing_distributions.htm : look towards the bottom of that page.)
The purpose of the simulation is to compute what is likely to happen by chance alone (the "null distribution"). For instance, we can ask whether Joe Paterno's football bowl record at Penn State of 24 wins in 37 attempts is just the result of blind luck, no better (or worse) than if each game had been decided by a fair coin flip. To do so, we might actually flip a coin 37 times, count the heads (to represent wins), and repeat this for a total of 99 iterations. We would find that very few of those 99 attempts achieved 24 or more heads--most likely 5 or fewer. That's pretty good evidence that over time his teams were better than the bowl competition, indicating they tended to be under-rated. (99 is too small a number to use here, really: I actually used 100,000, in which I observed 4972 occasions of 24 or more heads.)

The one aspect of this example that you control is the weight of evidence: is a 5/100 chance enough to convince you the result was not due to randomness (or luck)? Depending on the circumstance, some people need stronger evidence, others weaker. That's the role of m. When you draw the envelopes at the most extreme out of (say) 99 iterations, then you are estimating the lowest 1% and the highest 1%. You would guess that random fluctuations would place the curve between these envelopes 100 - 1 - 1 = 98% of the time. This corresponds (very roughly, because 99 iterations is still too little) to a "two-sided p-value of 0.02." If you don't need such strong evidence, you might choose (say) m = 3. Now the lower envelope represents the bottom 3/99 = 3.03% of the chance distribution and the upper envelope represents the top 3.03% of that distribution. Your two-sided p-value is around 6%. (Because 99 is so small, the true p-value could be as large as 15% or so, so beware! Do many more Monte-Carlo iterations if you possibly can.)
In some sense you're trying to depict a distribution of envelopes (curves). That's a complicated thing. One way is to choose some percentiles in advance, such as 1% (and therefore, by symmetry, 99%), 5% (and 95%), 10% (and 90%), 25% (and 75%). Out of 99 simulated curves these would correspond roughly to the lowest (and highest), the fifth lowest (and fifth highest), etc. (Comparing curves that can intersect each other in this way is problematic, though: when they cross, which is more extreme than the other?) Plotting those selected curves at least gives one a visual idea of the spread that is produced by chance mechanisms alone.

I hope you get the sense that this approach of generating a small bunch of curves (under the null hypothesis) and drawing a selected few is exploratory, approximate, and somewhat crude, because it is all of those. But it's far better than just guessing whether or not your data could have arisen by chance.

Related Solutions

[GIS] Robust alternatives to Moran’s I

(This is just too unwieldy at this point to turn into a comment)

This is in regards to local and global tests (not a specific, sample independent measure of auto-correlation). I can appreciate that the specific Moran's I measure is a biased estimate of the correlation (interpreting it in the same terms as Pearson correlation coefficient), I still don't see how the permutation hypothesis test is sensitive to the original distribution of the variable (either in terms of type 1 or type 2 errors).

Slightly adapting the code you provided in the comment (the spatial weights colqueen was missing);

library(spdep)
data(columbus)
attach(columbus)

colqueen <- nb2listw(col.gal.nb, style="W") #weights object was missing in original comment
MC1 <- moran.mc(PLUMB,colqueen,999)
MC2 <- moran.mc(log(PLUMB),colqueen,999)
par(mfrow = c(2,2))
hist(PLUMB, main = "Histogram PLUMB")
hist(log(PLUMB), main = "HISTOGRAM log(PLUMB)")
plot(MC1, main = "999 perm. PLUMB")
plot(MC2, main = "999 perm. log(PLUMB)")

When one conducts permutation tests (in this instance, I like to think of it as jumbling up space) the hypothesis test of global spatial auto-correlation should not be impacted by the distribution of the variable, as the simulated test distribution will in essence change with the distribution of the original variables. Likely one could come up with more interesting simulations to demonstrate this, but as you can see in this example, the observed test statistics is well outside of the generated distribution for both the original PLUMB and the logged PLUMB (which is much closer to a normal distribution). Although you can see the logged PLUMB test distribution under the null shifts closer to symmetry about 0.

enter image description here

I was going to suggest this as a alternative anyway, transforming the distribution to be approximately normal. I was also going to suggest looking up resources on spatial filtering (and similarly the Getis-Ord local and global statistics), although I'm not sure this will help with a scale free measure either (but perhaps may be fruitful for hypothesis tests). I will post back later with potentially more literature of interest.

[GIS] Test for spatial randomness with Ripley’s K or L function in R

Getting a p value is not that easy. Marcon and Lang (Testing randomness of spatial point patterns with the Ripley statistic, http://arxiv.org/abs/1006.1567) have demonstrated that (K1, K2..Kn) where K1, K2..Kn are the values of the Ripley s K function at different distances is a Gaussian vector. They have then computed its mean and covarianace matrix in the square and used a chi 2 test for spatial randomness. You can use the same idea by computing empirically (with Monte-Carlo simulations) the mean/covariance matrix in your domain. Regards Thibault Lagache

Best Answer

Related Solutions

[GIS] Robust alternatives to Moran’s I

[GIS] Test for spatial randomness with Ripley’s K or L function in R

Related Question