[GIS] Standard deviation and average over 1000 images with random null values

raster

I have a large collection of raster files that I would like to calculate the standard deviation over. The first issue is that ArcGIS cannot process over 1000 rasters via cell statistics, so I decided to go another route by summing the squared values of all the rasters, and then getting the variance and standard deviation. The problem is that each raster may have different areas with values, and some areas with null.

If I were to try to script an iterative function that would sum the rasters, the output would be completely null since with a large quantities of these maps would eventually have null values everywhere.

I then tried using another program, Spirits (http://spirits.jrc.ec.europa.eu/overview/about/). While this software was able to provide an image like this:

The problem with this output is that the null values were converted to 0.

I am looking for a method to process a large time series of these images that ignore the null values for their calculations, and can process over 1000 images at once. Do you guys have any recommendations?

Best Answer

You can do this in very easily in R using the overlay function in the raster package.

For demonstration purposes I simulate a raster stack object containing all of the rasters. In a real analysis this object would be a pointer to the rasters on disk and read them in blocks to keep the problem memory safe.

library(raster)
library(rgdal)

r <- raster(ncols=100, nrows=100)
  r[] <- runif(ncell(r))
  r <- stack(r)  
    for(i in 1:6) {
      cat("layer",i,"\n")
        r <- addLayer(r, r) 
      r[[i]] <- runif(ncell(r[[i]]))
    }

You could create a raster stack, from rasters on disk, using the list.files function and a wildcard to read all rasters in a directory. In this example the "r" object would represent a stack of all tiff rasters in "C:/mydir".

r <- stack(list.files("C:/mydir", "tif$"))

To calculate the standard deviation you would use overlay and pass it the sd function with the na.rm = TRUE argument to remove nodata values.

r.sd <- overlay(r,  fun = sd, na.rm = TRUE)

Keep in mind that the mean and standard deviation assume a Gaussian distribution and with skewed distributions are no longer relevant moments. For a spatial dataset this large, skewed "at pixel" distributions are possible and, depending on the data, you could also fall into the "taking the mean of a mean" trap. In the case of needing to take the mean of a derived mean (eg., mean of monthly precipitation) you would use the harmonic mean and not the arithmetic mean.

For a distribution independent measure of variation, akin to standard deviation or variance, you could use the median absolute deviation from the median (MAD), and the median for the central tendency. The "mad" function in R adjusts the coefficient for asymptotically normal consistency.

r.mad <- overlay(r,  fun = mad, na.rm = TRUE)   
r.median <- overlay(r,  fun = median, na.rm = TRUE) 
plot(r.mad)

If you read the functions help, invoked using ?overlay, you will see that one of the arguments is "filename". If you specify this argument a raster will be written to disk. The format can be defined through additional arguments, specifying format, bit-type, etc.. or just by the file extension (the easy way).

r.mad <- overlay(r,  fun = mad, na.rm = TRUE, filename = "C:/mydir/raster_means.tif")

Related Solutions

[GIS] Per-pixel (statistical) calculations on a raster stack using GDAL

Edit: adding some more stuff to make clearer, as per suggestion

#import the numpy and gdal libraries
import numpy as np
from osgeo import gdal

#an empty array/vector in which to store the different bands
layers = []

#open raster
ds = gdal.Open('raster.tif')

#loop thru bands of raster and append each band of data to 'layers'
#note that 'ReadAsArray()' returns a numpy array
for i in range(1, ds.RasterCount+1):
    layers.append(ds.GetRasterBand(i).ReadAsArray())

#dstack will take a number of n by m in tuple or list and stack them
#in the 3rd dimension so you end up with raster_stack being n by m by i, 
#where i is the number of bands
raster_stack = np.dstack(layers)

#call built in numpy functions std and mean, with a specified axis. if   
#no axis is set then it will return a number (scaler) but specifying
#axis=2 means it will calculate along the 'depth' axis, per pixel.
#with the return being n by m, the shape of each band.
std_raster = np.std(raster_stack, axis=2)
mean_raster = np.mean(raster_stack, axis=2)

[GIS] What effect does standard deviation have on the display of DEM rasters

If your values are Normally distributed then approximately 68%, 95% & 99.7% of the values lie within 1, 2 & 3 Standard Deviations respectively, see here, so if you are stretching your values of the colour map using SD(2) all of the values that are below 2.5% will be black and all over 97.5% will be white, (depending on your colour scale of course) - this will allow you to see the variation of the more common values without being swamped by the absolute max and min.

Say you are looking at a height map that includes structures and you have a single very high chimney and a single vary deep well included in the map, this might lead to a colour step of 50 ft when the rest of your structures are all in this range, SD will clip those two features allowing you to see the variation of the rest.

Best Answer

Related Solutions

[GIS] Per-pixel (statistical) calculations on a raster stack using GDAL

[GIS] What effect does standard deviation have on the display of DEM rasters

Related Question