Solved – Test for Complete Spatial Randomness taking into account background distribution

rspatial

I have a set of points on a grid which is a subset of a larger background set of points on a grid. I would like to test if the subset points are randomly distributed (or not). If it helps, here is the real-world example I am working on:

http://funcgen.vectorbase.org/expression-map/Anopheles_gambiae/VB-2013-12/?search1=GO%3A0005549%0A%0A

The length of the yellow sausages is proportional to the number of genes in the subset, while the area of the grey circles is proportional to the number of genes at each grid point belonging to the total/background set of genes. In this case the yellow genes are clearly non-randomly distributed. The spatial distribution is important, I believe… imagine just three yellow genes "landing" in three neighbouring grid squares – I'm pretty sure that considering each grid point in isolation there is nothing special about the sampling – however the fact that the three genes' grid nodes are so close on the grid of 500 nodes is special/interesting, right?

I think the R spatstat package can help me here but I am not sure how to take into account the "background distribution".

Here's what I have managed so far – with dummy random data:

library(spatstat)

# here is the "background" distribution of 100 points on a discrete 10x5 grid
# (yes there are duplicate points at the same coordinates)
bg = ppp(sample(0:9, 100, replace=T), sample(0:4, 100, replace=T), window=owin(c(0,9),c(0,4)))

# if you want to visualise the density of points:
# make a bitmap image version
bgim = as.im(bg, dimyx=c(5,10))
# plot(bgim)

# now make a random subset of the points
subset = bg[sample(1:100, 10, replace=F)]

# plot bg with plus signs
plot(bg, pch=3)
# add the fg points 
plot(subset, add=T)

So I can do test for CSR as follows

mad.test(subset)

but I don't know how to include the background distribution. I've tried fitting models with ppm but am out of my depth. Can anyone help please?

If possible I'd like the test to be fast (1-2s max on subsets up to a few thousand genes out of the total of 12,500 genes).

Writing this maybe I have a rough idea how to tackle this… Should I load the entire set of genes into a ppp (point pattern) and a "mark" the subset with a label (e.g. selected/unselected)? Then maybe some of the simple tests will then work straight out of the box? Think of it like a distribution of trees (the forest kind), and some are marked as infected with a disease. You would want to know if the disease is localised with respect to the background distribution of trees.

Update: my solution thanks to Andy W.

Pass an expression which generates random subsets of the background point set (the same size as the subset of interest of course)

result = mad.test(subset,
           simulate=expression(bg[sample(1:bg$n,subset$n,replace=F)]),
           nsim=1000)

Best Answer

One approach I have done in the past in the spatstat package is to create a set of simulations from the universe based on sampling with replacement (my work a point can happen repeatedly at the same location, e.g. crimes at an address) from that universe point pattern. Then you can use these samples as a reference distribution for whatever test you are interested in.

Here is a function to create those sub-samples (simply change the sub_universe line to sample without replacement if that is how you want the simulations to be drawn). (I wrote this 3 years ago it appears, and I'm sure it can be improved for computation time.)

#My awful function to generate simulation envelopes of a spatstat object given the universe
ppp_lists <- function(universe_x, universe_y, sub_ppp, nlist) {
require(spatstat)
myppp_list <- c() #make empty list
universe_xy <- data.frame(x = universe_x, y = universe_y) #make dataframe of X & Y objects to sample from
sampsize <- sub_ppp$n
    for (i in 1:nlist) {
             sub_universe <- universe_xy[sample(nrow(universe_xy),size=sampsize,replace = TRUE),] #sampling with replacement from that dataframe.
             current_ppp <-  ppp(sub_universe$x, sub_universe$y, window =  sub_ppp$window)   #making that into a ppp object 
                                                                                         #(taking window from subsample ppp object)
         myppp_list[[i]] <- current_ppp                                                  #appending that object to a list
}
         return(myppp_list)
}

Now, with that function generates a list that can be supplied to the envelope function as the simulation bands. Here is an example of passing the list using the simulate argument to mad.test:

#Now making simulation evelopes based on universe (warnings are for duplicate points)
SimEvel <- ppp_lists(universe_x = bg$x, universe_y = bg$y, sub_ppp = subset, nlist = 99)

#Now can use the user supplied envelopes for the mad.test
mytest <- mad.test(subset, simulate=SimEvel)
mytest

Any test that uses calculations for the density will be off by some constant here (from a set of finite points it is not 100% clear to me how you should define area). But the simulation envelopes should be fine for hypothesis testing.

There are other functions in the spatstat package for binary marked events like disease infection, but those aren't directly applicable here.

Another approach might be to turn the window into a raster image where the valid locations are only the very small defined pixels where points can occur. Then all the usual functions that take a window will work (I am not very familiar with mad.test, so I can't say if it will be applicable for this test). The various tests in the package will become more tedious for the more points, but generating the simulations shouldn't be too expensive.

Related Solutions

Solved – Calculating the probability of gene list overlap between an RNA seq and a ChIP-chip data set

You are close, with your use of dhyper and phyper, but I don't understand where 0:2 and -1:2 are coming from.

The p-value you want is the probability of getting 100 or more white balls in a sample of size 400 from an urn with 3000 white balls and 12000 black balls. Here are four ways to calculate it.

sum(dhyper(100:400, 3000, 12000, 400))
1 - sum(dhyper(0:99, 3000, 12000, 400))
phyper(99, 3000, 12000, 400, lower.tail=FALSE)
1-phyper(99, 3000, 12000, 400)

These give 0.0078.

dhyper(x, m, n, k) gives the probability of drawing exactly x. In the first line, we sum up the probabilities for 100 – 400; in the second line, we take 1 minus the sum of the probabilities of 0 – 99.

phyper(x, m, n, k) gives the probability of getting x or fewer, so phyper(x, m, n, k) is the same as sum(dhyper(0:x, m, n, k)).

The lower.tail=FALSE is a bit confusing. phyper(x, m, n, k, lower.tail=FALSE) is the same as 1-phyper(x, m, n, k), and so is the probability of x+1 or more. [I never remember this and so always have to double check.]

At that stattrek.com site, you want to look at the last row, "Cumulative Probability: P(X $\ge$ 100)," rather than the first row "Hypergeometric Probability: P(X = 100)."

Any particular number that you draw is going to have small probability (in fact, max(dhyper(0:400, 3000, 12000, 400)) gives $\sim$0.050), and getting 101 or 102 or any larger number is even more interesting that 100, and the p-value is the probability, if the null hypothesis were true, of getting a result as interesting or more so than what was observed.

Here's a picture of the hypergeometric distribution in this case. You can see that it's centered at 80 (20% of 400) and that 100 is pretty far out in the right tail. enter image description here

Solved – Using kriging with very sparse data

From ten points you are going to have 45 (10*(10-1)/2) points in your variogram cloud from the distances between each pair of points. Once the system has binned that, or even without binning, its going to be dominated by noise, I reckon. Get a plot of the variogram cloud to see what I mean.

If autokrige can't fit a nice smooth variogram then it will do what it did, and just go 'heck, I can't work out the correlation with distance with just 10 points, my best guess is just the mean'. It really can't do better.

If you want something to look 'realistic', then you could feed it variogram parameters with a bigger range, that would over-smooth the output. But then you may as well just do inverse-distance weighting if all you want is a pretty picture. The advantage of kriging is that it is realistic. But it rejects your reality and replaces it with its own...

SUggestions:

Get a plot of the variogram cloud
Get more data :)
Look into bivariate kriging for your case with the two different data sets. I think the theory exists, there may even be code for it...

Best Answer

Related Solutions

Solved – Calculating the probability of gene list overlap between an RNA seq and a ChIP-chip data set

Solved – Using kriging with very sparse data

Related Question