[GIS] Using supervised vs unsupervised classification in identification of region of interest

land-covermachine learningremote sensing

I was introduced to machine learning and remote sensing recently.
My task was to classify the satellite images into vegetation and non vegetation.
We were introduced to two approaches.

Supervised learning – where we had wkt or geojson files made from ground truth. These files had polygons which were used to train the model.
satellite images from WorldView-3 Satellite Sensor
Unsupervised classification where the pixels were classified based on NDVI values using clustering models such as K-means, Fuzzy C-means clustering.
satellite images from landsat 8

While all of these things were virtually spoon fed and I took the code samples from here and there. I still fail to understand which method is used where, specifically with context of crop forecasting.

What is the advantage of collecting the ground truth, when we can use the unsupervised learning to classify the images?

If it is about accuracy, then are there any specific examples as to how ground truth helps in accuracy in crop forecasting?

Best Answer

Both supervised and unsupervised classification methods require some degree of knowledge of the area of interest. Most important are 1) the quality of the spectral data in which the classification algorithm is to be used and 2) the level of class detail required.

Unsupervised classification algorithms require the analyst to assign labels and combine classes after the fact into useful information classes (e.g. forest, agricultural, water, etc). In many cases, this after the fact assignment of spectral clusters is difficult or not possible because these clusters contain assemblages of mixed land cover types. Generally speaking, unsupervised classification is useful for quickly assigning labels to uncomplicated, broad land cover classes such as water, vegetation/non-vegetation, forested/non-forested, etc). Furthermore, unsupervised classification may reduce analyst bias.

Supervised classification allows the analyst to fine tune the information classes--often to much finer subcategories, such as species level classes. Training data is collected in the field with high accuracy GPS devices or expertly selected on the computer. Consider for example if you wished to classify percent crop damage in corn fields. A supervised approach would be highly suited to this type of problem because you could directly measure the percent damage in the field and use these data to train the classification algorithm. Using training data on the result of an unsupervised classification would likely yield more error because the spectral classes would contain more mixed pixels than the supervised approach. Similarly, collecting in the field crop species training data is preferable to expertly selecting pixels on screen as it is often very difficult to determine which crops are growing visually.

I highly recommend reviewing research from Dr. Russell Congalton, who has produced many landmark accuracy assessment papers pertaining to remote sensing classification approaches. Here are some references to get you started:

Congalton, R. G. (1991). A review of assessing the accuracy of classifications of remotely sensed data. Remote sensing of environment, 37(1), 35-46.
Congalton, R. G., & Green, K. (2008). Assessing the accuracy of remotely sensed data: principles and practices. CRC press.

Related Solutions

[GIS] Computing unsupervised random forest classification in R

Random Forests in unlabeled (unsupervised) mode does not return explicit classes but, rather something analogous to scaled multivariate distances which is based on node proximities. Without the proximity matrix, you do not have a usable unlabeled model. And yes, for large problems, even using a sparse matrix, the very nature of the approach causes the proximity matrix to get huge. This may very well be the reason that you have not seen published approaches using Random Forests in unsupervised remote sensing.

Based on the proximities an approach that I have seen, to derive/test clusters, is to use a modified K-means on the proximity matrix. Alternately, you may be able to trick the imputation function, using the random forests option, in the yaImpute package to perform a matrix imputation which would return something analogs to a k nearest neighbor (kNN) that could then be assigned to clusters based on a similarity matrix.

It is nothing near as straightforward as what you are thinking and I would encourage you to research this approach before jumping in with both feet.

**** Edit 12/14/2018 A few versions ago I added an unsupervised random forests function to the rfUtilities package. I would not recommend it on large data such as rasters but it is a useful clustering method. Here is a simple example.

library(rfUtilities)
library(sp)

data(meuse)
  meuse <- na.omit(meuse)

n = 6  
clust.meuse <- rf.unsupervised(meuse, n=n, proximity = TRUE, 
                               silhouettes = TRUE)
( meuse$k <- clust.meuse$k )

mds <- stats:::cmdscale(clust.meuse$distances, eig=TRUE, k=n)
  colnames(mds$points) <- paste("Dim", 1:n)
  mds.col <- ifelse(clust.meuse$k == 1, rainbow(6)[1],
               ifelse(clust.meuse$k == 2, rainbow(6)[2],
                 ifelse(clust.meuse$k == 3, rainbow(6)[3],
                   ifelse(clust.meuse$k == 4, rainbow(6)[4],
                    ifelse(clust.meuse$k == 5, rainbow(6)[5],
                     ifelse(clust.meuse$k == 6, rainbow(6)[6], NA))))))
plot(mds$points[,1:2],col=mds.col, pch=20)                 
pairs(mds$points, col=mds.col, pch=20)

coordinates(meuse) <- ~x+y
plot(meuse, col=mds.col, pch=19)
  box()

Best Answer

Related Solutions

[GIS] Computing unsupervised random forest classification in R

Related Question