[GIS] How to do regression analysis out-of-memory on a set of large rasters in R

rraster

I am trying to do a regression analysis on a set of large rasters (254,004,000 cells each). Ultimately, I want to run something like the following (or a bit more complex, but let's start simple!):

model<-lm(dv ~ iv1+iv2+iv3… data=df,na.action=na.exclude)

where "dv" is the values from one raster and "iv1", "iv2" "iv3" … are values from other rasters (up to 10 variables) with the same extent and resolution. It seems I should be able to do this out-of-memory using the Raster package, but I am confused how. Whether I create a brick, stack, or set of individual Raster objects, I cannot figure out how to send the variables to the lm function without using getValues and thus calling everything into memory (mine cannot even handle two variables).

A point in the right direction would be much appreciated!

Best Answer

The help for lm references biglm:

biglm in package biglm for an alternative way to fit linear models to large datasets.

The help pages for biglm indicate this package was developed for precisely such problems. The algorithm it references, AS274, is an updating procedure, allowing a solution based on a subset of the cases (cells) to be modified as additional cases are given.

Although this package appears to solve the problem, performing regression on such enormous datasets is (a) likely to be meaningless and (b) ignores opportunities to learn much more about the data. It is almost surely the case that posited relationships among the variables will change from one location to another. Why not capitalize on the size of the dataset, then, and conduct separate regressions within various windows or tiles of the data? For instance, you could tile your raster area into a 10 by 10 grid, reducing the size of each raster to less than three million cells, making in-memory calculations not only feasible but fast. If these regressions produce significantly different results you will have learned much (and will have avoided the error of combining them into one global regression); if they do not produce different results, you already have estimates of the global regression and you can justify regressing the entire dataset if you wish to do that.

It would likely be worthwhile to go even further and explore how the regressions vary with tile size. Much could be said about what tile sizes would be appropriate. I will limit my comments to just two simple ones. First, it would be a good idea to focus on sizes that are substantially larger than the longest range of spatial covariance of any of the variables. Second, the tiles need to be small enough to make computation practicable. There may be a wide range of choices between these two extremes.