[GIS] Stratified Random Sampling with large dataset in ArcGIS

arcgis-10.2arcgis-desktopdissolvelarge datasetsmerge

I want to create a Stratified Random Sample with a land cover classification map. The purpose for the Stratrified Random Sample is to produce an Accuracy Assessment. I am using ArcGIS 10.2 with access to Spatial Analyst and all other tools.

I have six land cover types and I want 300 random points for three of the classes I am interested in.My first step was converting my raster to polygon. The result gives me over 3 million polygons to work with.

I have tried the Dissolve tool and making multi-part polygons, but some of my classes are too large, which results in an error message "Topology Invalid (Out of Memory).

I have tried using the Merge in the Editor toolbar. Again, due to my large number of polygons in some of my classes, it freezes and crashes ArcMap.

I would like to do the stratified random sample on a raster, rather than converting it to polygon form. Is this possible?

Are there any other tools/programs that I should use that may help combat my large dataset issue? I have consistently run into problems because I simply have too many polygons per land cover type.

Best Answer

Stay in raster format.

If ArcGIS would provide a zonal percentile function you could do this in a couple easy steps. Since it doesn't, you have to work around that limitation. There are many ways to proceed. The key ideas are:

  1. Determine the proportion of cells in each class represented by the desired sample of 300.

  2. By thresholding a grid of uniform random values, you can identify slightly more than the number of points needed in each class.

  3. Extract these points, then post-process them by retaining only those having the lowest random values within each class.

This procedure obtains a simple random sample without replacement, and independently, of each land class.


Here are some details. Let there be N classes (n = 3 in the question), each with n(i) cells (i = 1, 2, ..., N). Suppose you intend to select k(i) randomly from each group (k(i) = 300 for the class indexes i = 1, 2, 3 in the question). You will do that by generating a uniform random grid and comparing it to a threshold grid [T]. It has the constant value p(i) at each cell in class i, with p(i) somewhere (to be chosen appropriately) between 0 and 1.

The result of this comparison in class i is a random count X(i) having a Binomial(n(i), p(i)) distribution. Its expectation therefore is m(i) = n(i)*p(i), which needs to be close to k(i). However, X(i) can randomly be less than that. Its variance is v(i) = n(i)*p(i)*(1-p(i)). To be really sure that X(i) will be at least k(i), we need a very small lower percentile of the distribution of X(i) to exceed k(i). You can work out what the resulting value of p(i) has to be by means of a statistical calculator, but an effective rule of thumb (for moderately small numbers of classes N) is to ensure that k(i) is less than several standard deviations less than m(i). Letting a reasonable starting value for "several" be L, we obtain the quadratic inequality

k(i) <= m(i) - L*sqrt(n(i)*p(i)*(1-p(i)))

whose approximate solution is

p(i) = (k(i)/n(i)) * (1 + z(i) + z(i)^2/2)

where

z(i) = L/sqrt(k(i)).

In the question, k(i) = 300 and it looks like the n(i) are in the millions. Taking L = 3, say, gives z(i) = 3 / sqrt(300), which is about 1/6 (and z(i)^2/2 is small enough to neglect) This tells us to ask for 1/6 more points than the desired proportion k(i)/n(i).

With this (simple) calculation out of the way,

  • Reclassify the land classes i into the values of the p(i) you computed. Call the resulting grid [P].

  • Create a uniform random grid [R].

  • Set the values of [R] to null wherever [R] > [P].

This last grid will have slightly more than X = k(1) + k(2) + ... + k(N) (= 900 in the question) cells that are not null. At this juncture use your favorite method (there are many that will work) to

  1. combine it with the land class grid and

  2. output all tuples (i, R(i), x coordinate, y coordinate) as a point shapefile.

Post-process the shapefile by retaining only the lowest k(i) values for each class i: that's the desired sample. If you are so extremely unlucky as to not obtain enough points in some class, just repeat the whole process from the beginning. (Consider increasing the value of L when you do so.)


Alternatively, doing the work in R would be simple, because it offers functions to find the lowest k(i) random values in each class directly, rather than having to take several steps and convert things into a point shapefile format. The same conceptual workflow is involved.

Related Question