Raster Cell Summary Statistics – Differences Between R, QGIS, and ArcGIS

arcgis-desktopcell-statisticsqgisrraster

I'm new to GIS and I have a question related to the use of different programs (R, QGIS and ArcGIS) to do summary statistics. I've used R and QGIS to sum the cell values of my raster (using respectively the cellStats function from R (as follows) and the Raster layer statistics from the QGIS toolbox) and in both cases I get the same result.

Following is a link to the data I'm using: https://1drv.ms/u/s!Aod2icrdkQPXgSDjgZg3O_NIO71Y

In R, I do (with raster package):

my_raster <- raster("path_to_my_raster") 
sum_raster_cells <-cellStats(my_raster, 'sum')

In QGIS, I do:

However, when I use the arcpy.Statistics_analysis from ArcGIS (as follows), I get a different result. The later actually seems to give a more accurate result than the one that I found with R and QGIS, but I wanted to understand why these results are different. Why isn't the total sum equal when using the same data?

import arcpy 
arcpy.env.workspace = "path_where_gdb_is"
arcpy.Statistics_analysis("GPW_ac_ad1_15","output_path",[["SUM_E_ATOT","SUM"]])

Best Answer

**** Edit, this is an example of why procedural details and a reproducible example are important. It looks like your data was summarized to zones. In R, and plausibly QGIS, you are attempting to sum each pixel and not take the sum of the zones. In R if I use sum(unique(my_raster[!is.na(my_raster)])) I get 7,647,114 however if I just sum the entire raster the result is 12,010,503,977. The use of unique collapses the values to a single value per zone thus, yielding the correct sum. ****

Functionally, a raster is nothing more than a matrix or array (stack of matrices). In R the raster package provides special object classes for rasters along with functions but, you can still treat it as a matrix/array object. Since you can treat a raster as a matrix or vector object (through coercion), it is quite easy to retrieve the full values of a given raster in ways independent of cellStats as to test results. Here we can directly observe the behavior and consistency of derived statistics around raster objects.

Let's add the raster library, create a matrix (rmat) and an associated raster (r) to work with.

library(raster)

( rmat <- matrix(round(runif(100, 0,1500),0), nrow=10, ncol=10) )
( r <- raster(rmat) )
  plot(r)

Now, we can sum the matrix as well as the raster. Since the raster is not large (resulting in it being in memory), the function will decide to not use a sampling approach so, the results should be exact.

sum(rmat)
cellStats(r, sum)

We can also collapse the entire raster into a vector and return the sum;

r[]
sum(r[])

and directly compare it to the original matrix by doing the same (unnecessarily because we just sum the matrix directly).

as.vector(rmat)
sum(as.vector(rmat))

Directly observing the functional behavior of objects gives a clearer understanding of what is going on behind the scenes here. Since it is possible to coerce the data (eg., r[]) into a simple object, such as a vector or matrix, you can have confidence in the results and easily track down issues when they arise.

Related Solutions

Zonal Statistics Tool for Median and Mode Raster Values

I haven't calculated mode values before (most frequently occurring value), but at least for quantile values you could try to use the plugin I coded (LecoS). Although not primarily intended for this use it is able to calculate some "zonal statistics" for polygon overlays.
How to:

Install numpy and scipy for your System and QGIS. On Windows choose the OSGEO4W advanced Install and select python-numpy and python-scipy. On Linux consult your linux package manager/compiler.
Download the experimental(!) LecoS plugin via the plugin downloader. The rest should be pretty self-explaining.

Please pay attention that the plugin is still experimental and calculation might fail sometimes! If you want a clear procedure suited for your needs than i would advise you to look into numpy and also R with the raster and sp packages. I am also happy to see detailed bug reports and i will do the best i can to fix them. However if the calculation fails for you, you better lookout for other tools as well.

[GIS] Use Values from Raster Statistics (e.g. sum of all cells) in Raster Calculator

A calculation like

A*B*C / Sum (A*B*C)

will perform the A*B*C operation twice. To avoid that duplicate effort instead do it in two steps

X = A*B*C
X / Sum(X)

Storing A*B*C has an immediate payoff. To implement the Sum operation, use a zonal summary operator with the entire raster as the zone. That requires placing a constant, non-null value at every non-null cell of the raster. A simple way to accomplish this is to equate the raster with itself, thus:

ZonalSum(X, X==X)

(The syntax for ZonalSum will depend on the platform and the version of the software.)

The full workflow therefore is

X = A*B*C
X / ZonalSum(X, X==X)

Best Answer

Related Solutions

Zonal Statistics Tool for Median and Mode Raster Values

[GIS] Use Values from Raster Statistics (e.g. sum of all cells) in Raster Calculator

Related Question