[GIS] Using sampleRandom() from large raster without NA values in R

rrandomrastersampling

I have a number of vary large rasters which need to be randomly sampled with the return value being a matrix of x, y, and value. The raster package sampleRandom(raster,n, na.rm=TRUE, xy=TRUE) will do this just fine most of the time. When working correctly, this function returns a matrix of non-NA values for n coordinate pairs. When NA values come up in the sample, they are dropped and replaced by a non-NA value.

However, for my rasters (smallest being 4e^7 cells and some having a high percent of NA values), sampleRandom() returns a matrix substantially smaller than n coordinate pairs. Presumably, this due to sampled NA values, not being replaced when they are sampled.

Why does the sampleRandom function return incomplete results on the real-world data example?

Ss @Radar correctly pointed out, the documentation for the raster packages states:
With argument na.rm=TRUE, the returned sample may be smaller than requested

With this, my question becomes; how do I draw a work around this and efficiently draw random sample of n coordinate pairs?

Example 1: this works correctly in retrieving a random sample of n from a larger raster that is cropped and masked by spatial polygons. returns a matrix of 2000 cordiante points.

region1 <- rbind(c(0,0), c(50,0), c(50,50), c(20,20), c(0,0))
region2 <- rbind(c(50,0), c(80,0), c(100,50), c(60,40), c(80,20), c(50,0))
polys <- SpatialPolygons(list(Polygons(list(Polygon(region1)), "region1"),
                        Polygons(list(Polygon(region2)), "region2")))

r <- raster(ncol=1000, nrow=1000)
r[] <- runif(ncell(r),0,1)
extent(r) <- matrix(c(0, 0, 1000, 1000), nrow=2)

r_crop <- crop(r, extent(polys), snap="out", progress='text')
r_mask <- mask(r_crop, polys) 

plot(r_mask)
plot(polys, add=TRUE)

x <- sampleRandom(r_mask,2000, na.rm=TRUE, xy=TRUE)
nrow(x)

>[1] 2000

results of crop and mask for sample data

Example 2: The next example is with real data that consist of a universal raster (geo.r) of 2e^8 cells and a subset of spatial polygons (geo.poly) that contain 1200 polygons and is a smaller extent than geo.r. This code incorrectly results in a matrix of much less than n; depending on the random sample. a few runs produce a matrix of between 3 and 117 non-NA coordinate pairs.

require(maptools)
Prj <- "+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs +ellps=GRS80 +towgs84=0,0,0"
modeling_areas_SHP <- "C:/.../modeling_areas_dissolve.shp"
geo.polys <- readShapePoly(modeling_areas_SHP, IDvar="area_ID", proj4string=CRS(Prj))
geo.poly <- modeling_areas[modeling_areas$area_ID == i,] #subset the shapefile

geo.r <- raster("C:/.../cost_raster")

geo.r_crop <- crop(geo.r, extent(geo.poly), snap="out", progress='text')
geo.r_mask <- mask(geo.r_crop, geo.poly, progress='text')

plot(geo.r_mask)
plot(geo.poly, add=TRUE)

x <- sampleRandom(geo.r_mask,2000, na.rm=TRUE, xy=TRUE)
nrow(x)

>[1] 117

results of crop and mask for real data

To me at least, the above examples are the same except for the overall size of the rasters and complexity of the polygons; two very important factors. Obviously I cannot provide the real world data because of file sizes, but I hope the code will suffice.

How do I fix this?

I used this work around hack, but it is not super efficient. However, was more efficient than using spsample() from the "sp" package.

micro_sample = 50000
tmp_rand_smple <- data.frame(x = numeric(0), y = numeric(0), layer = numeric(0))
            while(nrow(tmp_rand_smple) < micro_sample){
                tmp_smple <- data.frame(sampleRandom(geo.r_mask,10000, na.rm=TRUE, xy=TRUE)) ### 10k is an arbitrary chunk, loops until > micro_sample
                tmp_rand_smple <- rbind(tmp_rand_smple, tmp_smple)
                tmp_rand_smple <- unique(tmp_rand_smple[c("x", "y", "layer")]) # remove any duplicate coordinate pairs
            }
            tmp_rand_smple <- tmp_rand_smple[1:micro_sample,] # trim to length of micro_sample

Example 3: Here is an example of the above code that can be reproduced with the linked shapefile. On my computer this fails at returning the required number of random sample
https://www.dropbox.com/s/7poaqcxju808arw/riverine_region_1.zip

Prj <- "+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs +ellps=GRS80 +towgs84=0,0,0"
geo.poly <- readShapePoly("FILE LOCATION:/riverine_region_1", IDvar="area_ID", proj4string=CRS(Prj))  ## set file location

r <- raster(ncol=5202, nrow=8182)
r[] <- runif(ncell(r),0,1)
extent(r) <- matrix(c( 1533500, 592219.7, 1447689, 537662.6), nrow=2)

r_crop <- crop(r, extent(geo.poly), snap="out", progress='text')
r_mask <- mask(r_crop, geo.poly, progress='text') 

plot(r_mask)
plot(geo.poly, add=TRUE)

x <- sampleRandom(r_mask,2000, na.rm=TRUE, xy=TRUE)
nrow(x)

Best Answer

Just pad your desired number of random samples and then sample back down to the correct n. This should account for the occasional NA that are produced and subsequently removed with the na.rm=TRUE argument.

    require(raster)
    # Create example data
    r1 <- raster(ncols=500, nrows=500, xmn=0)
      r1[] <- runif(ncell(r1))
    r2 <- raster(ncols=500, nrows=500, xmn=0)
      r2[] <- runif(ncell(r2))  
    r <- stack(r1,r2)

    # Sample size
    n=50

    # Random sample of raster  
    r.samp <- sampleRandom(r, size=(n+20), na.rm=TRUE, sp=FALSE, asRaster=FALSE) 
      dim( r.samp )[1]

   # Create a random sample of n size to subset r.samp
   #   (works with dataframe, matrix and sp objects)
   r.samp <- r.samp[sample( 1:dim(r.samp)[1], n),]
    dim ( r.samp )[1]

If you can read the raster into memory an approach in sp would be to use rgdal to create a SpatialGridDataFrame the coerce it into a SpatialPointsDataFrame so you can easily remove NA's and end up with a point object of your subsample. You can then sample subsequent rasters using this sp point object. The @data dataframe can be extract and coerced into a matrix for your purposes.

require(sp)
require(rgdal)
require(raster)

n=50 # Number of random samples

# Read raster data using rgdal, results in SpatialGridDataFrame 
r <- readGDAL(system.file("external/test.ag", package="sp")[1])
  class(r)
    spplot(r, "band1")

# Coerce into SpatialPointsDataFrame    
r <- as(r, "SpatialPointsDataFrame")      

# remove NA's   
r@data <- na.omit(r.pts@data)
  plot(r, pch=20)

# Create random sample. Object is a SpatialPointsDataFrame     
r.samp <- r[sample(1:dim(r)[1], n),]
  plot(r.samp, pch=20, col="red", add=TRUE)   
    class(r.samp)

#  Use r.samp sp object for additional sampling 
#    Add extra column and coerce to raster stack
r2 <- readGDAL(system.file("external/test.ag", package="sp")[1])
  r2@data <- data.frame(r2@data, band2=runif(dim(r2)[1]) ) 
    r2 <- stack(r2)

# Extract raster values using r.samp object
r.samp@data <- data.frame(r.samp@data, band2=extract(r2[[2]], r.samp))
  str(r.samp@data)

Related Solutions

[GIS] Sampling random points from large raster file with replacement using R

Try the following code. The idea is to create a set of [row, col] indicators based on the RasterStack dimensions (rows and cols). Than you can easily used these indicators on the stack to subset all values.

The function below sampleStack gets a RasterStack and n number of values to sample, and gives a data frame with [row, col] positions and values extracted by layer.

library(raster)
# Generate raster layers

r <- raster(matrix(rnorm(100, 0, 1), nrow = 10))
for (i in 1:5) {r <- stack(r, raster(matrix(rnorm(100, 0, 1), nrow = 10)))} # run 5 times

sampleStack <- function(r, n) {
  rowSample <- sample(1:r@nrows, size = n, replace = TRUE)
  colSample <- sample(1:r@ncols, size = n, replace = TRUE)
  pairs <- data.frame("rowInd" = rowSample, "colInd" = colSample)
  out <- as.data.frame(cbind(pairs, as.data.frame(t(apply(pairs, MARGIN = 1, function(x) {return(r[x[1],x[2]])})))))
  colnames(out)[3:ncol(out)] <- names(r)
  return(out)
}

# Example Run
sampleStack(r = r, n = 14)

   rowInd colInd   layer.1.1  layer.2.1   layer.1.2  layer.2.2    layer.1    layer.2
1       5      4 -0.09678111  1.6844843  0.82574090  0.5328165  0.66721846  0.1936958
2       4      8  0.64982724  0.5467126  1.59975344  0.2757094  0.94797866 -0.1798319
3       3      4 -1.35927393 -1.2774878  0.77616160  0.1429519  1.10396643  1.1444793
4       7      9  1.45380719 -0.6128730 -0.53011041 -0.3138787 -0.86586255  0.9056694
5       6      2 -0.49808353 -0.1272448 -1.96004940 -0.5663870 -0.07217682 -1.7568981
6       4     10  0.01607546 -0.5113896  1.19713933 -1.6322803 -1.04051134  0.7135125
7       2      4  1.10798593 -0.2610036  0.56009222  2.4618433 -0.44356484  1.0332427
8       3      1  1.70168812  1.7643488 -0.09976064 -0.2386893 -1.04266622  0.2019014
9      10      3 -1.82791351  1.5666126 -1.79275437  0.2946007  0.96467732 -0.6951626
10      6      4  0.54795051  0.1378088 -1.53793046 -0.5989934 -1.64424273 -0.3463153
11      1      2 -0.86287672 -0.2408750 -0.81438516 -2.0200205  1.16523355  0.4052408
12      2      2 -0.47916254 -0.6778470  0.79086436 -0.5692255  0.96205715 -0.5146865
13      5      4 -0.09678111  1.6844843  0.82574090  0.5328165  0.66721846  0.1936958
14      6      1 -1.04973148 -0.3973457 -0.24445969  0.4061588 -1.50143806  0.3896232

[GIS] Randomly generate points using weights from raster

I don't understand the "minimum distance" thing you mentioned.

Here's how to generate 1000 points uniformly within cells but with the number in each cell weighted by the cell value:

Make a test 3x4 raster with some positive random numbers:

> set.seed(12)
> r = raster(matrix(runif(12),3,4))

Get the cell half-width for later:

> hs = res(r)/2

Now work out which cell each of our 1000 points is going in by sampling from the number of cells (12) with replacement, weighted by the value in the cells:

> ptscell = sample(1:12, 1000, prob=r[], replace=TRUE)

Now find the centre of those 1000 cell numbers:

> centres = xyFromCell(r,ptscell)

And generate random uniform points in the cell by using the centre and the half-width/height from earlier:

> pts = cbind(runif(nrow(centres),centres[,1]-hs[1],centres[,1]+hs[1]),runif(nrow(centres),centres[,2]-hs[2],centres[,2]+hs[2]))

Voila:

> plot(r)
> points(pts)

Best Answer

Related Solutions

[GIS] Sampling random points from large raster file with replacement using R

[GIS] Randomly generate points using weights from raster

Related Question