Solved – Test for Complete Spatial Randomness taking into account background distribution

rspatial

I have a set of points on a grid which is a subset of a larger background set of points on a grid. I would like to test if the subset points are randomly distributed (or not). If it helps, here is the real-world example I am working on:

http://funcgen.vectorbase.org/expression-map/Anopheles_gambiae/VB-2013-12/?search1=GO%3A0005549%0A%0A

The length of the yellow sausages is proportional to the number of genes in the subset, while the area of the grey circles is proportional to the number of genes at each grid point belonging to the total/background set of genes. In this case the yellow genes are clearly non-randomly distributed. The spatial distribution is important, I believe… imagine just three yellow genes "landing" in three neighbouring grid squares – I'm pretty sure that considering each grid point in isolation there is nothing special about the sampling – however the fact that the three genes' grid nodes are so close on the grid of 500 nodes is special/interesting, right?

I think the R spatstat package can help me here but I am not sure how to take into account the "background distribution".

Here's what I have managed so far – with dummy random data:

library(spatstat)

# here is the "background" distribution of 100 points on a discrete 10x5 grid
# (yes there are duplicate points at the same coordinates)
bg = ppp(sample(0:9, 100, replace=T), sample(0:4, 100, replace=T), window=owin(c(0,9),c(0,4)))

# if you want to visualise the density of points:
# make a bitmap image version
bgim = as.im(bg, dimyx=c(5,10))
# plot(bgim)

# now make a random subset of the points
subset = bg[sample(1:100, 10, replace=F)]

# plot bg with plus signs
plot(bg, pch=3)
# add the fg points 
plot(subset, add=T)

So I can do test for CSR as follows

mad.test(subset)

but I don't know how to include the background distribution. I've tried fitting models with ppm but am out of my depth. Can anyone help please?

If possible I'd like the test to be fast (1-2s max on subsets up to a few thousand genes out of the total of 12,500 genes).

Writing this maybe I have a rough idea how to tackle this… Should I load the entire set of genes into a ppp (point pattern) and a "mark" the subset with a label (e.g. selected/unselected)? Then maybe some of the simple tests will then work straight out of the box? Think of it like a distribution of trees (the forest kind), and some are marked as infected with a disease. You would want to know if the disease is localised with respect to the background distribution of trees.

Update: my solution thanks to Andy W.

Pass an expression which generates random subsets of the background point set (the same size as the subset of interest of course)

result = mad.test(subset,
           simulate=expression(bg[sample(1:bg$n,subset$n,replace=F)]),
           nsim=1000)

Best Answer

One approach I have done in the past in the spatstat package is to create a set of simulations from the universe based on sampling with replacement (my work a point can happen repeatedly at the same location, e.g. crimes at an address) from that universe point pattern. Then you can use these samples as a reference distribution for whatever test you are interested in.

Here is a function to create those sub-samples (simply change the sub_universe line to sample without replacement if that is how you want the simulations to be drawn). (I wrote this 3 years ago it appears, and I'm sure it can be improved for computation time.)

#My awful function to generate simulation envelopes of a spatstat object given the universe
ppp_lists <- function(universe_x, universe_y, sub_ppp, nlist) {
require(spatstat)
myppp_list <- c() #make empty list
universe_xy <- data.frame(x = universe_x, y = universe_y) #make dataframe of X & Y objects to sample from
sampsize <- sub_ppp$n
    for (i in 1:nlist) {
             sub_universe <- universe_xy[sample(nrow(universe_xy),size=sampsize,replace = TRUE),] #sampling with replacement from that dataframe.
             current_ppp <-  ppp(sub_universe$x, sub_universe$y, window =  sub_ppp$window)   #making that into a ppp object 
                                                                                         #(taking window from subsample ppp object)
         myppp_list[[i]] <- current_ppp                                                  #appending that object to a list
}
         return(myppp_list)
}

Now, with that function generates a list that can be supplied to the envelope function as the simulation bands. Here is an example of passing the list using the simulate argument to mad.test:

#Now making simulation evelopes based on universe (warnings are for duplicate points)
SimEvel <- ppp_lists(universe_x = bg$x, universe_y = bg$y, sub_ppp = subset, nlist = 99)

#Now can use the user supplied envelopes for the mad.test
mytest <- mad.test(subset, simulate=SimEvel)
mytest

Any test that uses calculations for the density will be off by some constant here (from a set of finite points it is not 100% clear to me how you should define area). But the simulation envelopes should be fine for hypothesis testing.

There are other functions in the spatstat package for binary marked events like disease infection, but those aren't directly applicable here.

Another approach might be to turn the window into a raster image where the valid locations are only the very small defined pixels where points can occur. Then all the usual functions that take a window will work (I am not very familiar with mad.test, so I can't say if it will be applicable for this test). The various tests in the package will become more tedious for the more points, but generating the simulations shouldn't be too expensive.

Related Question