[GIS] Finding correlation between point location and raster value

arcgis-desktopcorrelationrspatial-analyst

I have what I thought was a very simple problem but I think I may have over thought it…

I have a population density raster and a point feature layer showing the locations of X (about 70 points).

All I simply want to do is a spatial correlation; do the instances of X tend to all occur in raster cells of higher pop density? no correlation? etc. purely locational. (The points have many attributes that I could begin to do multivariate regressions on but I don't want to at this stage).

I have tried autocorrelation (Morans etc. in ArcGIS), turning X into a Boolean raster and running band statistics in ArcGIS, Geographically Weighted Regression and I'm now onto Bivariate Spatial Association.

Someone tell me I've had a massive oversight?! Perhaps plotting the raster cell values against a Boolean 1 or 0 of the variable X?

This question is similar but fizzled out while this person looked like they want to test an interpolation. Don't mind answers in Arc and/or R.

Best Answer

If I understand you correctly, you want to test sample variation. That is to say, how well your sample distribution matches your population (raster) distribution. This is not a correlative relationship and is commonly done by comparing the sample mean and variance against the population.

Here is an example where I calculate mean, variance and quantiles for a random sample and the raster. I also plot the sample and population distributions.

First, for an example, we create a raster and take a random sample. I am simulating a very skewed distribution for the raster to add potential bias to the sample variation.

require(raster)
  sigma = 0.6; mu = 2
r <- raster(ncol=500, nrow=500)
  r[] <- dlnorm(seq(-10, 60, length.out = ncell(r)), mu, sigma) * 100

n <- 20 
  s <- sampleRandom(r, n, na.rm=TRUE, sp=TRUE) 
    s@data <- data.frame(s@data, r=extract(r, s))
      head(s@data)

plot(r)
  plot(s, pch=20, add=TRUE, col="black")

We can now calculate distributional moments for the sample and population.

# raster (population) mean, varaince and quantiles 
( rmean <- cellStats(r, stat="mean", na.rm=TRUE, asSample=FALSE) )  
( rvar <- cellStats(r, stat="sd", na.rm=TRUE, asSample=FALSE)^2 )     
( rquant <- quantile(r, probs = c(0.25, 0.50, 0.75)) )

# random points (sample) mean, varaince and quantiles 
( smean <- mean(s@data[,"r"]) )
( svar <- var(s@data[,"r"]) )
( squant <- quantile(s@data[,"r"], probs = c(0.25, 0.50, 0.75)) )

We can also plot the population and sample distributions. To test the null hypothesis that x and y were drawn from the same continuous distribution, by comparing distributions, we could perform a nonparametric two-sample Kolmogorov-Smirnov test using the ks.test function.

# Plot sample and population distributions
par(mfrow=c(2,1))
  density(r, xlim=c(cellStats(r, "min"),cellStats(r, "max")), 
          main="Population distribution")
  plot(density(s@data[,"r"]), xlim=c(cellStats(r, "min"),cellStats(r, "max")),
       main="sample distribution")

A two-sample student t-test can easily be calculated by coercing the raster into a vector however, this is not memory safe and can be dangerous. I would recommend taking a very large random sample to represent the population.

r <- as.vector(as.matrix(r))
  r <- na.omit(r)      
    t.test(s@data[,"r"], y = r)  

Because it sounds like you have a binominal response, you really should test the sample variation in an ANOVA or MANOVA framework so you can make sure that you have not only captured the population variation but also that the problem is balanced (variation equally split between [0,1]).

You may want to seriously consider a Bootstrap or Monte Carlo approach to identify bias. Since you are using a density raster, bandwidth and Kernel distribution can radically effect the estimates. You can quantify this uncertainty with these two methods.

Related Question