[GIS] Using sampleRandom() from large raster without NA values in R

rrandomrastersampling

I have a number of vary large rasters which need to be randomly sampled with the return value being a matrix of x, y, and value. The raster package sampleRandom(raster,n, na.rm=TRUE, xy=TRUE) will do this just fine most of the time. When working correctly, this function returns a matrix of non-NA values for n coordinate pairs. When NA values come up in the sample, they are dropped and replaced by a non-NA value.

However, for my rasters (smallest being 4e^7 cells and some having a high percent of NA values), sampleRandom() returns a matrix substantially smaller than n coordinate pairs. Presumably, this due to sampled NA values, not being replaced when they are sampled.

Why does the sampleRandom function return incomplete results on the real-world data example?

Ss @Radar correctly pointed out, the documentation for the raster packages states:
With argument na.rm=TRUE, the returned sample may be smaller than requested

With this, my question becomes; how do I draw a work around this and efficiently draw random sample of n coordinate pairs?

Example 1: this works correctly in retrieving a random sample of n from a larger raster that is cropped and masked by spatial polygons. returns a matrix of 2000 cordiante points.

region1 <- rbind(c(0,0), c(50,0), c(50,50), c(20,20), c(0,0))
region2 <- rbind(c(50,0), c(80,0), c(100,50), c(60,40), c(80,20), c(50,0))
polys <- SpatialPolygons(list(Polygons(list(Polygon(region1)), "region1"),
                        Polygons(list(Polygon(region2)), "region2")))

r <- raster(ncol=1000, nrow=1000)
r[] <- runif(ncell(r),0,1)
extent(r) <- matrix(c(0, 0, 1000, 1000), nrow=2)

r_crop <- crop(r, extent(polys), snap="out", progress='text')
r_mask <- mask(r_crop, polys) 

plot(r_mask)
plot(polys, add=TRUE)

x <- sampleRandom(r_mask,2000, na.rm=TRUE, xy=TRUE)
nrow(x)

>[1] 2000

results of crop and mask for sample data

Example 2: The next example is with real data that consist of a universal raster (geo.r) of 2e^8 cells and a subset of spatial polygons (geo.poly) that contain 1200 polygons and is a smaller extent than geo.r. This code incorrectly results in a matrix of much less than n; depending on the random sample. a few runs produce a matrix of between 3 and 117 non-NA coordinate pairs.

require(maptools)
Prj <- "+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs +ellps=GRS80 +towgs84=0,0,0"
modeling_areas_SHP <- "C:/.../modeling_areas_dissolve.shp"
geo.polys <- readShapePoly(modeling_areas_SHP, IDvar="area_ID", proj4string=CRS(Prj))
geo.poly <- modeling_areas[modeling_areas$area_ID == i,] #subset the shapefile

geo.r <- raster("C:/.../cost_raster")

geo.r_crop <- crop(geo.r, extent(geo.poly), snap="out", progress='text')
geo.r_mask <- mask(geo.r_crop, geo.poly, progress='text')

plot(geo.r_mask)
plot(geo.poly, add=TRUE)

x <- sampleRandom(geo.r_mask,2000, na.rm=TRUE, xy=TRUE)
nrow(x)

>[1] 117

results of crop and mask for real data

To me at least, the above examples are the same except for the overall size of the rasters and complexity of the polygons; two very important factors. Obviously I cannot provide the real world data because of file sizes, but I hope the code will suffice.

How do I fix this?


I used this work around hack, but it is not super efficient. However, was more efficient than using spsample() from the "sp" package.

micro_sample = 50000
tmp_rand_smple <- data.frame(x = numeric(0), y = numeric(0), layer = numeric(0))
            while(nrow(tmp_rand_smple) < micro_sample){
                tmp_smple <- data.frame(sampleRandom(geo.r_mask,10000, na.rm=TRUE, xy=TRUE)) ### 10k is an arbitrary chunk, loops until > micro_sample
                tmp_rand_smple <- rbind(tmp_rand_smple, tmp_smple)
                tmp_rand_smple <- unique(tmp_rand_smple[c("x", "y", "layer")]) # remove any duplicate coordinate pairs
            }
            tmp_rand_smple <- tmp_rand_smple[1:micro_sample,] # trim to length of micro_sample

Example 3: Here is an example of the above code that can be reproduced with the linked shapefile. On my computer this fails at returning the required number of random sample
https://www.dropbox.com/s/7poaqcxju808arw/riverine_region_1.zip

Prj <- "+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs +ellps=GRS80 +towgs84=0,0,0"
geo.poly <- readShapePoly("FILE LOCATION:/riverine_region_1", IDvar="area_ID", proj4string=CRS(Prj))  ## set file location

r <- raster(ncol=5202, nrow=8182)
r[] <- runif(ncell(r),0,1)
extent(r) <- matrix(c( 1533500, 592219.7, 1447689, 537662.6), nrow=2)

r_crop <- crop(r, extent(geo.poly), snap="out", progress='text')
r_mask <- mask(r_crop, geo.poly, progress='text') 

plot(r_mask)
plot(geo.poly, add=TRUE)

x <- sampleRandom(r_mask,2000, na.rm=TRUE, xy=TRUE)
nrow(x)

Best Answer

Just pad your desired number of random samples and then sample back down to the correct n. This should account for the occasional NA that are produced and subsequently removed with the na.rm=TRUE argument.

    require(raster)
    # Create example data
    r1 <- raster(ncols=500, nrows=500, xmn=0)
      r1[] <- runif(ncell(r1))
    r2 <- raster(ncols=500, nrows=500, xmn=0)
      r2[] <- runif(ncell(r2))  
    r <- stack(r1,r2)

    # Sample size
    n=50

    # Random sample of raster  
    r.samp <- sampleRandom(r, size=(n+20), na.rm=TRUE, sp=FALSE, asRaster=FALSE) 
      dim( r.samp )[1]

   # Create a random sample of n size to subset r.samp
   #   (works with dataframe, matrix and sp objects)
   r.samp <- r.samp[sample( 1:dim(r.samp)[1], n),]
    dim ( r.samp )[1]

If you can read the raster into memory an approach in sp would be to use rgdal to create a SpatialGridDataFrame the coerce it into a SpatialPointsDataFrame so you can easily remove NA's and end up with a point object of your subsample. You can then sample subsequent rasters using this sp point object. The @data dataframe can be extract and coerced into a matrix for your purposes.

require(sp)
require(rgdal)
require(raster)

n=50 # Number of random samples

# Read raster data using rgdal, results in SpatialGridDataFrame 
r <- readGDAL(system.file("external/test.ag", package="sp")[1])
  class(r)
    spplot(r, "band1")

# Coerce into SpatialPointsDataFrame    
r <- as(r, "SpatialPointsDataFrame")      

# remove NA's   
r@data <- na.omit(r.pts@data)
  plot(r, pch=20)

# Create random sample. Object is a SpatialPointsDataFrame     
r.samp <- r[sample(1:dim(r)[1], n),]
  plot(r.samp, pch=20, col="red", add=TRUE)   
    class(r.samp)

#  Use r.samp sp object for additional sampling 
#    Add extra column and coerce to raster stack
r2 <- readGDAL(system.file("external/test.ag", package="sp")[1])
  r2@data <- data.frame(r2@data, band2=runif(dim(r2)[1]) ) 
    r2 <- stack(r2)

# Extract raster values using r.samp object
r.samp@data <- data.frame(r.samp@data, band2=extract(r2[[2]], r.samp))
  str(r.samp@data)