If I understand you correctly, you want to test sample variation. That is to say, how well your sample distribution matches your population (raster) distribution. This is not a correlative relationship and is commonly done by comparing the sample mean and variance against the population.
Here is an example where I calculate mean, variance and quantiles for a random sample and the raster. I also plot the sample and population distributions.
First, for an example, we create a raster and take a random sample. I am simulating a very skewed distribution for the raster to add potential bias to the sample variation.
require(raster)
sigma = 0.6; mu = 2
r <- raster(ncol=500, nrow=500)
r[] <- dlnorm(seq(-10, 60, length.out = ncell(r)), mu, sigma) * 100
n <- 20
s <- sampleRandom(r, n, na.rm=TRUE, sp=TRUE)
s@data <- data.frame(s@data, r=extract(r, s))
head(s@data)
plot(r)
plot(s, pch=20, add=TRUE, col="black")
We can now calculate distributional moments for the sample and population.
# raster (population) mean, varaince and quantiles
( rmean <- cellStats(r, stat="mean", na.rm=TRUE, asSample=FALSE) )
( rvar <- cellStats(r, stat="sd", na.rm=TRUE, asSample=FALSE)^2 )
( rquant <- quantile(r, probs = c(0.25, 0.50, 0.75)) )
# random points (sample) mean, varaince and quantiles
( smean <- mean(s@data[,"r"]) )
( svar <- var(s@data[,"r"]) )
( squant <- quantile(s@data[,"r"], probs = c(0.25, 0.50, 0.75)) )
We can also plot the population and sample distributions. To test the null hypothesis that x and y were drawn from the same continuous distribution, by comparing distributions, we could perform a nonparametric two-sample Kolmogorov-Smirnov test using the ks.test function.
# Plot sample and population distributions
par(mfrow=c(2,1))
density(r, xlim=c(cellStats(r, "min"),cellStats(r, "max")),
main="Population distribution")
plot(density(s@data[,"r"]), xlim=c(cellStats(r, "min"),cellStats(r, "max")),
main="sample distribution")
A two-sample student t-test can easily be calculated by coercing the raster into a vector however, this is not memory safe and can be dangerous. I would recommend taking a very large random sample to represent the population.
r <- as.vector(as.matrix(r))
r <- na.omit(r)
t.test(s@data[,"r"], y = r)
Because it sounds like you have a binominal response, you really should test the sample variation in an ANOVA or MANOVA framework so you can make sure that you have not only captured the population variation but also that the problem is balanced (variation equally split between [0,1]).
You may want to seriously consider a Bootstrap or Monte Carlo approach to identify bias. Since you are using a density raster, bandwidth and Kernel distribution can radically effect the estimates. You can quantify this uncertainty with these two methods.
For reference, here is an approach, in memory, that uses mapply
. This is one of the lesser known apply functions in R that lets you apply a function across objects.
Here, when I create objects from the raster stacks, I wrap the stack
function call in list
and values
. This results in list objects, containing data.frames holding the raster values. This makes the data ready for mapply
, which expects list objects. The use of sapply allows me to use a numeric row index 1:nrow(s1[[1]])
to aggregate the row-by-row correlations. This results in a vector of correlations, matching the rows in each lists data.frames. This vector is ordered to the cells in the source rasters and can be piped directly back into a source raster.
library(raster)
r <- raster(system.file("external/test.grd", package="raster"))
s1 <- list(values( stack(r, r*0.15, r/1805.78) ))
s2 <- list(values( stack(r^2, r/0.87*10, r/sum(values(r),na.rm=T)) ))
r[] <- as.vector(mapply(function(X,Y) { sapply(1:nrow(s1[[1]]), function(row)
cor(X[row,], Y[row,]))}, X=s1, Y=s2))
plot(r)
Best Answer
Firstly, what you have is a regression problem - how (if at all) does X depend on Y.
Secondly, you have a Poisson regression problem - X is a count, and fairly small count at that (0,1,2,3).
Thirdly you have two covariates or explanatory variables - population density and distance to commercial area.
Fourthly your data is spatial, so you might not assume that they are independent. This affects the inference from the parameter estimates (generally the standard errors will be smaller than they should be).
So, initially you should plug all those into a generalised linear model using the
glm
function. Then map the residuals at the data points and see if they look spatially correlated.