[GIS] Stratified Random Sampling with large dataset in ArcGIS

arcgis-10.2arcgis-desktopdissolvelarge datasetsmerge

I want to create a Stratified Random Sample with a land cover classification map. The purpose for the Stratrified Random Sample is to produce an Accuracy Assessment. I am using ArcGIS 10.2 with access to Spatial Analyst and all other tools.

I have six land cover types and I want 300 random points for three of the classes I am interested in.My first step was converting my raster to polygon. The result gives me over 3 million polygons to work with.

I have tried the Dissolve tool and making multi-part polygons, but some of my classes are too large, which results in an error message "Topology Invalid (Out of Memory).

I have tried using the Merge in the Editor toolbar. Again, due to my large number of polygons in some of my classes, it freezes and crashes ArcMap.

I would like to do the stratified random sample on a raster, rather than converting it to polygon form. Is this possible?

Are there any other tools/programs that I should use that may help combat my large dataset issue? I have consistently run into problems because I simply have too many polygons per land cover type.

Best Answer

Stay in raster format.

If ArcGIS would provide a zonal percentile function you could do this in a couple easy steps. Since it doesn't, you have to work around that limitation. There are many ways to proceed. The key ideas are:

Determine the proportion of cells in each class represented by the desired sample of 300.
By thresholding a grid of uniform random values, you can identify slightly more than the number of points needed in each class.
Extract these points, then post-process them by retaining only those having the lowest random values within each class.

This procedure obtains a simple random sample without replacement, and independently, of each land class.

Here are some details. Let there be N classes (n = 3 in the question), each with n(i) cells (i = 1, 2, ..., N). Suppose you intend to select k(i) randomly from each group (k(i) = 300 for the class indexes i = 1, 2, 3 in the question). You will do that by generating a uniform random grid and comparing it to a threshold grid [T]. It has the constant value p(i) at each cell in class i, with p(i) somewhere (to be chosen appropriately) between 0 and 1.

The result of this comparison in class i is a random count X(i) having a Binomial(n(i), p(i)) distribution. Its expectation therefore is m(i) = n(i)*p(i), which needs to be close to k(i). However, X(i) can randomly be less than that. Its variance is v(i) = n(i)*p(i)*(1-p(i)). To be really sure that X(i) will be at least k(i), we need a very small lower percentile of the distribution of X(i) to exceed k(i). You can work out what the resulting value of p(i) has to be by means of a statistical calculator, but an effective rule of thumb (for moderately small numbers of classes N) is to ensure that k(i) is less than several standard deviations less than m(i). Letting a reasonable starting value for "several" be L, we obtain the quadratic inequality

k(i) <= m(i) - L*sqrt(n(i)*p(i)*(1-p(i)))

whose approximate solution is

p(i) = (k(i)/n(i)) * (1 + z(i) + z(i)^2/2)

where

z(i) = L/sqrt(k(i)).

In the question, k(i) = 300 and it looks like the n(i) are in the millions. Taking L = 3, say, gives z(i) = 3 / sqrt(300), which is about 1/6 (and z(i)^2/2 is small enough to neglect) This tells us to ask for 1/6 more points than the desired proportion k(i)/n(i).

With this (simple) calculation out of the way,

Reclassify the land classes i into the values of the p(i) you computed. Call the resulting grid [P].
Create a uniform random grid [R].
Set the values of [R] to null wherever [R] > [P].

This last grid will have slightly more than X = k(1) + k(2) + ... + k(N) (= 900 in the question) cells that are not null. At this juncture use your favorite method (there are many that will work) to

combine it with the land class grid and
output all tuples (i, R(i), x coordinate, y coordinate) as a point shapefile.

Post-process the shapefile by retaining only the lowest k(i) values for each class i: that's the desired sample. If you are so extremely unlucky as to not obtain enough points in some class, just repeat the whole process from the beginning. (Consider increasing the value of L when you do so.)

Alternatively, doing the work in R would be simple, because it offers functions to find the lowest k(i) random values in each class directly, rather than having to take several steps and convert things into a point shapefile format. The same conceptual workflow is involved.

Related Solutions

[GIS] Selecting raster values at random locations in a raster in python/arcgis

If I understand you correctly I think you can solve it like this (their are comments in the code that explain what is going on):

import numpy, arcpy, random

#Establish the extent which your random samples can be within
rangeX = (100, 2500000) # Enter the actual range in x values of your rasters * 100 in order to get coordinates with decimals
rangeY = (100, 2500000) # Enter the actual range in y values of your rasters * 100 in order to get coordinates with decimals
qty = 1000  # Enter in the number greater than random points you need


#Generate random x,y coordinates
randPoints = []
while len(randPoints) < qty:
    x = random.randrange(*rangeX)/100.0 # divide by 100.0 to be able to get coordinates with decimal values
    y = random.randrange(*rangeY)/100.0 # divide by 100.0 to be able to get coordinates with decimal values
    randPoints.append((x,y))

#Create dictionary of key and lists, list will house tuples of (x,y,z)
#Enter in actual classified values for dictionary keys
valueDict = {'Class1' : [],
             'Class2' : [],
             'Class3' : [],
             'Class4' : []}

######Get Rasters bands as well as cell height, width, origin info to be able to get
######index of x,y location in the numpy array
arcpy.env.workspace = inPath + '\\aster.img'
bands = arcpy.ListRasters()
Ras = arcpy.Raster(inPath + '\\aster.img')
originX = Ras.extent.upperLeft.X
originY = Ras.extent.upperLeft.Y
pixelWidth = Ras.meanCellWidth
pixelHeight = Ras.meanCellHeight

#Create a list that houses each raster array
bandsList = []
for i in bands:
    bandsList.append(arcpy.RasterToNumPyArray(i).astype(numpy.float32))

#loop over all of the random point locations and collect raster values at their
#locations if the dictionary entry for that value is not full populate it
#with a tuple of (x,y,z), keep going until each class is full
for i in randPoints:
    X = i[0]
    Y = i[1]
    xOffset = int((X-originX)/pixelWidth)
    yOffset = int(abs(Y-originY)/pixelHeight)
    for j in range(0,len(bands)):
       sampleValue = bandsList[j][yOffset, xOffset]
       for key in valueDict.keys():
           if sampleValue == key:
               if len(valueDict[key]) < 10:
                   valueDict[key].append((X, Y, sampleValue))
                   break
               else:
                   continue

This is a variation of a script that I have used to extract raster values at random x,y locations, so it may need some tweaking but I think the major elements are their to get the job done for you.

[GIS] Stratified Random sampling in ArcGIS

dissolve your forest as a multipart polygon

intersect with your fishnet

use the create random point tool with the result of your intersection as constraining feature class, and 1 as the number of point

CreateRandomPoints_management (out_path, out_name, {constraining_feature_class}, {constraining_extent}, {number_of_points_or_field}, {minimum_allowed_distance}, {create_multipoint_output}, {multipoint_size})

Best Answer

Related Solutions

[GIS] Selecting raster values at random locations in a raster in python/arcgis

[GIS] Stratified Random sampling in ArcGIS

Related Question