QGIS – How to Determine Which Raster Attribute Covers the Highest Area of a Polygon

qgisrasterzonal statistics

I'm relatively new to GIS and QGIS but I have a problem that I'm not certain how to solve. I searched here for relevant information but found none. (Or, I used incorrect search terms.) My problem is thus:

I have a raster file of crop data determined from satellite imagery. Pixel size is 30 meter x 30 meter.

I have a shapefile with over 270,000 polygons of varying sizes.

I can get general 'zonal statistics' through QGIS but these are relatively meaningless: count, mean, and sum don't help me in determining the predominant–by area–crop type within a given polygon.

Ideally, I'd like to be able to determine which crop type covers the highest area in every polygon. As an example, in the image below, the parcel in the center of the image has at least six crop types within it's boundaries. However, clearly there is one predominant crop type. How can I associate that single crop type with that parcel? Is this possible?

An example image

I'm using QGIS version 2.4.0 on Windows 7.

If there is an R solution, I would enjoy hearing that, too.

Best Answer

Here is an R solution that uses the mode of the observed distribution.

Create some example data

require(raster)
  r <- raster(ncol=36, nrow=18)
    r[] <- sample(1:3, ncell(r), replace=TRUE)
      polys <- SpatialPolygons(list(Polygons(list(Polygon(rbind(c(-180,-20),c(-160,5),
                 c(-60, 0),c(-160,-60),c(-180,-20)))), 1), 
                 Polygons(list(Polygon(rbind(c(80,0),c(100,60),c(120,0),
                 c(120,-55),c(80,0)))), 2)))
    polys <- SpatialPolygonsDataFrame(polys, data.frame(row.names=c("1","2"),
                                      ID=1:2))
plot(r)
  plot(polys, col="red", add=TRUE)

Function to calculate peak mode of distribution. This could alternatively, be done by fitting a spline to x. The density approach is just a nice shortcut.

dmode <- function(x) {
  den <- density(x, kernel=c("gaussian"))
    ( den$x[den$y==max(den$y)] )   
} 

Here we extract values for all polygons and calculate mode value using the dmode function. Note; the base R function "mode" returns the storage type of the object class (e.g., numeric, character) and not the distributional mode. Since the mode is a function of the peak distribution we need to round it back to a whole number so it corresponds to a "real" value in the observed distribution. If the raster data were floating point or you wanted a true hinge point, then this would not be necessary.

( ep <- extract(r, polys) )
  ep <- lapply(sp, function(x) x[!is.na(x)]) # remove NA values
    ( x <- round(unlist(lapply(ep, FUN=dmode)),digits=0) )

Since you have a large number of polygons you may want to implement a memory safe for loop that processes one polygon at a time. However, this may be slow.

x <- vector()
  for(i in 1:nrow(polys)){
    p <- polys[i,]
      v <- extract(r, p)
        v <- lapply(v, function(x) x[!is.na(x)]) 
          x <- append(x, round(unlist(lapply(v, FUN=dmode)),digits=0) )
    }
      cat("Mode(s):", x,  "\n")

You can then join the resulting "mode" values back to each polygon and plot results.

polys@data$MajClass <- x 
cols <- ifelse(polys@data$MajClass == 1, "red",
          ifelse(polys@data$MajClass == 2, "green",
            ifelse(polys@data$MajClass == 3, "blue", NA)))
  plot(r, legend=FALSE)
    plot(polys, col=cols, add=TRUE)
      legend("bottomright", legend=c(unique(x)), fill=cols)

I would add, another efficient way to calculate frequencies is the table function. Let's create a vector of 1,2,3, the table function will then return counts of each value in the vector.

x <- c(1,1,1,2,2,3,3,3,3,3)
 table(x)

We can then return the class name associated with the most frequent value.

names(which.max(table(x)))

In the context of the problem at hand, we can use lapply function to the extracted raster values.

( ep <- extract(r, polys) )
  ep <- lapply(sp, function(x) x[!is.na(x)])
    ( x <- unlist(lapply(ep, FUN=function(x) { as.numeric(names(which.max(table(x)))) })) ) 

An exact frequency "count" approach will give a more precise answer. Let's take an example where extreme values are existing in the same polygon and compare mode and count.

x=c(1,1,1,2,2,255,255,255,255,255)
  round(dmode(x),digits=0)
    names(which.max(table(x)))
Related Question