[GIS] Efficiently process large rasters

rraster

I am trying to process some raster layers from Global Forest Change (https://earthenginepartners.appspot.com/science-2013-global-forest/download_v1.4.html, data produced by Hansen et al., 2013). However, even though I am working in a 32 Gb RAM memory station, everything goes extremely slow.

I am aware that the data I am working with is heavy. Each raster has 40000×40000 raster cells and, while some of rasters only weight some 20 Mb, others go as high as 600 Mb.

Then, procedures such as reclassify() and aggregate() take as long as ~20-30 minutes for the 20 Mb layers. Taking into account that I need to download and process some 2×500 tiles… it is going to take ages.

Is there an efficient way to deal with this kind of data? Is there something more efficient than the raster library (which is awesome, by the way) for this particular workflow?

The rasters are downloaded as separate tiles that form a larger grid spanning the whole world, and there is a separate link for each block (e.g.: the first one is https://storage.googleapis.com/earthenginepartners-hansen/GFC-2016-v1.4/Hansen_GFC-2016-v1.4_treecover2000_00N_000E.tif). This is the part of my code that processes each tile:

library(raster)

down_links <- read.table("https://storage.googleapis.com/earthenginepartners-hansen/GFC-2016-v1.4/treecover2000.txt")

x <- 1
temp_tc <- tempfile()
download.file(as.character(down_links[x,]), destfile = temp_tc) # down_links is a table with the links mentioned above)
tc <- raster(file.path(temp_tc))
unlink(temp_tc)

# Reclassify raster
tc2 <- tc
matrix <- structure(list(from = c(0L, 20L), to = c(20L, 100L), becomes = 0:1), .Names = c("from", 
"to", "becomes"), class = "data.frame", row.names = c(NA, -2L
))
tc2 <- reclassify(tc2, matrix)

# Calculate area
layer_area <- area(tc2)
tc3 <- tc2 * layer_area

# Aggregate raster to a larger pixel size
tc4 <- aggregate(tc3, fact = 32, fun = sum)

Best Answer

One approach that helps to prevent overloading your RAM when working with large files with the raster package is to write your transformed rasters to file ('writeRaster()' function) and then read them back into the workspace ('raster("path")'). So for example, where you have assigned tc2, tc3 and tc4, those entire objects are only held in memory, whereas when read from file, only the data structure is read, while all of the cell values are not held in memory but are just looked up as needed.

However, applying functions like 'aggregate()' etc may still be slow on these objects!

Related Solutions

[GIS] How to do regression analysis out-of-memory on a set of large rasters in R

The help for lm references biglm:

biglm in package biglm for an alternative way to fit linear models to large datasets.

The help pages for biglm indicate this package was developed for precisely such problems. The algorithm it references, AS274, is an updating procedure, allowing a solution based on a subset of the cases (cells) to be modified as additional cases are given.

Although this package appears to solve the problem, performing regression on such enormous datasets is (a) likely to be meaningless and (b) ignores opportunities to learn much more about the data. It is almost surely the case that posited relationships among the variables will change from one location to another. Why not capitalize on the size of the dataset, then, and conduct separate regressions within various windows or tiles of the data? For instance, you could tile your raster area into a 10 by 10 grid, reducing the size of each raster to less than three million cells, making in-memory calculations not only feasible but fast. If these regressions produce significantly different results you will have learned much (and will have avoided the error of combining them into one global regression); if they do not produce different results, you already have estimates of the global regression and you can justify regressing the entire dataset if you wish to do that.

It would likely be worthwhile to go even further and explore how the regressions vary with tile size. Much could be said about what tile sizes would be appropriate. I will limit my comments to just two simple ones. First, it would be a good idea to focus on sizes that are substantially larger than the longest range of spatial covariance of any of the variables. Second, the tiles need to be small enough to make computation practicable. There may be a wide range of choices between these two extremes.

[GIS] Using sampleRandom() from large raster without NA values in R

Just pad your desired number of random samples and then sample back down to the correct n. This should account for the occasional NA that are produced and subsequently removed with the na.rm=TRUE argument.

    require(raster)
    # Create example data
    r1 <- raster(ncols=500, nrows=500, xmn=0)
      r1[] <- runif(ncell(r1))
    r2 <- raster(ncols=500, nrows=500, xmn=0)
      r2[] <- runif(ncell(r2))  
    r <- stack(r1,r2)

    # Sample size
    n=50

    # Random sample of raster  
    r.samp <- sampleRandom(r, size=(n+20), na.rm=TRUE, sp=FALSE, asRaster=FALSE) 
      dim( r.samp )[1]

   # Create a random sample of n size to subset r.samp
   #   (works with dataframe, matrix and sp objects)
   r.samp <- r.samp[sample( 1:dim(r.samp)[1], n),]
    dim ( r.samp )[1]

If you can read the raster into memory an approach in sp would be to use rgdal to create a SpatialGridDataFrame the coerce it into a SpatialPointsDataFrame so you can easily remove NA's and end up with a point object of your subsample. You can then sample subsequent rasters using this sp point object. The @data dataframe can be extract and coerced into a matrix for your purposes.

require(sp)
require(rgdal)
require(raster)

n=50 # Number of random samples

# Read raster data using rgdal, results in SpatialGridDataFrame 
r <- readGDAL(system.file("external/test.ag", package="sp")[1])
  class(r)
    spplot(r, "band1")

# Coerce into SpatialPointsDataFrame    
r <- as(r, "SpatialPointsDataFrame")      

# remove NA's   
r@data <- na.omit(r.pts@data)
  plot(r, pch=20)

# Create random sample. Object is a SpatialPointsDataFrame     
r.samp <- r[sample(1:dim(r)[1], n),]
  plot(r.samp, pch=20, col="red", add=TRUE)   
    class(r.samp)

#  Use r.samp sp object for additional sampling 
#    Add extra column and coerce to raster stack
r2 <- readGDAL(system.file("external/test.ag", package="sp")[1])
  r2@data <- data.frame(r2@data, band2=runif(dim(r2)[1]) ) 
    r2 <- stack(r2)

# Extract raster values using r.samp object
r.samp@data <- data.frame(r.samp@data, band2=extract(r2[[2]], r.samp))
  str(r.samp@data)

Best Answer

Related Solutions

[GIS] How to do regression analysis out-of-memory on a set of large rasters in R

[GIS] Using sampleRandom() from large raster without NA values in R

Related Question