[GIS] How to measure similarity of SpatialLines objects

geometryrsimilarity

I created two SpatialLines objects in R: .

These objects were created this way:

library(sp)
xy <- cbind(x,y)
xy.sp = sp::SpatialPoints(xy)
spl1 <- sp::SpatialLines(list(Lines(Line(xy.sp), ID="a")))

Now I want to somehow conclude that this is the same line rotated and flipped, and that the difference between them is equal to 0 (i.e. shape is equal).

To do that, one can use maptools package and rotate line #1, e.g.:

spl180 <- maptools::elide(spl1, rotate=180)

Each rotated line must then be checked versus line #2 using rgeos package, e.g.:

hdist <- rgeos::gDistance(spl180, spl2, byid=FALSE, hausdorff=TRUE)

However, this is so computationally expensive way to match SpatialLines objects, especially if the number of objects is around 1000.

Is there any clever way to do this job?

P.S. Moreover, the above described approach does not guarantee all possible rotations and flips.

P.S2. If line #1 is zoomed out with respect to line #2, the the difference between line #1 and #2 must still be equal to 0.

UPDATE:

enter image description here

Best Answer

Any truly general-purpose effective method will standardize the representations of the shapes so that they will not change upon rotation, translation, reflection, or trivial changes in internal representation.

One way to do this is to list each connected shape as an alternating sequence of edge lengths and (signed) angles, starting from one end. (The shape should be "clean" in the sense of having no zero-length edges or straight angles.) To make this invariant under reflection, negate all the angles if the first nonzero one is negative.

(Because any connected polyline of n vertices will have n-1 edges separated by n-2 angles, I have found it convenient in the R code below to use a data structure consisting of two arrays, one for the edge lengths $lengths and the other for the angles, $angles. A line segment will have no angles at all, so it's important to handle zero-length arrays in such a data structure.)

Such representations can be ordered lexicographically. Some allowance should be made for floating-point errors accumulated during the standardization process. An elegant procedure would estimate those errors as a function of the original coordinates. In the solution below, a simpler method is used in which two lengths are considered equal when they differ by a very small amount on a relative basis. Angles may differ only by a very small amount on an absolute basis.

To make them invariant under reversal of the underlying orientation, choose the lexicographically earliest representation between that of the polyline and its reversal.

To handle multi-part polylines, arrange their components in lexicographic order.

To find the equivalence classes under Euclidean transformations, then,

Create standardized representations of the shapes.
Perform a lexicographic sort of the standardized representations.
Make a pass through the sorted order to identify sequences of equal representations.

The computation time is proportional to O(n*log(n)*N) where n is the number of features and N is the largest number of vertices in any feature. This is efficient.

It's probably worth mentioning in passing that a preliminary grouping based on easily-calculated invariant geometric properties, such as polyline length, center, and moments about that center, often can be applied to streamline the entire process. One only needs to find subgroups of congruent features within each such preliminary group. The full method given here would be needed for shapes that otherwise would be so remarkably similar that such simple invariants still would not distinguish them. Simple features constructed from raster data might have such characteristics, for instance. However, since the solution given here is so efficient anyway, that if one is going to go to the effort of implementing it, it might work just fine all by itself.

Example

The left hand figure shows five polylines plus 15 more that were obtained from those via random translation, rotation, reflection, and reversal of internal orientation (which is not visible). The right hand figure colors them according to their Euclidean equivalence class: all figures of a common color are congruent; different colors are not congruent.

R code follows. When the inputs were updated to 500 shapes, 500 extra (congruent) shapes, with a mean of 100 vertices per shape, the execution time on this machine was 3 seconds.

This code is incomplete: because R does not have a native lexicographic sort, and I did not feel like coding one from scratch, I simply perform the sorting on the first coordinate of each standardized shape. That will be fine for the random shapes created here, but for production work a full lexicographic sort should be implemented. The function order.shape would be the only one affected by this change. Its input is a list of standardized shape s and its output is the sequence of indexes into s that would sort it.

#
# Create random shapes.
#
n.shapes <- 5      # Unique shapes, up to congruence
n.shapes.new <- 15 # Additional congruent shapes to generate
p.mean <- 5        # Expected number of vertices per shape
set.seed(17)       # Create a reproducible starting point
shape.random <- function(n) matrix(rnorm(2*n), nrow=2, ncol=n)
shapes <- lapply(2+rpois(n.shapes, p.mean-2), shape.random)
#
# Randomly move them around.
#
move.random <- function(xy) {
  a <- runif(1, 0, 2*pi)
  reflection <- sign(runif(1, -1, 1))
  translation <- runif(2, -8, 8)
  m <- matrix(c(cos(a), sin(a), -sin(a), cos(a)), 2, 2) %*%
    matrix(c(reflection, 0, 0, 1), 2, 2)
  m <- m %*% xy + translation
  if (runif(1, -1, 0) < 0) m <- m[ ,dim(m)[2]:1]
  return (m)
}
i <- sample(length(shapes), n.shapes.new, replace=TRUE)
shapes <- c(shapes, lapply(i, function(j) move.random(shapes[[j]])))
#
# Plot the shapes.
#
range.shapes <- c(min(sapply(shapes, min)), max(sapply(shapes, max)))
palette(gray.colors(length(shapes)))
par(mfrow=c(1,2))
plot(range.shapes, range.shapes, type="n",asp=1, bty="n", xlab="", ylab="")
invisible(lapply(1:length(shapes), function(i) lines(t(shapes[[i]]), col=i, lwd=2)))
#
# Standardize the shape description.
#
standardize <- function(xy) {
  n <- dim(xy)[2]
  vectors <- xy[ ,-1, drop=FALSE] - xy[ ,-n, drop=FALSE]
  lengths <- sqrt(colSums(vectors^2))
  if (which.min(lengths - rev(lengths))*2 < n) {
    lengths <- rev(lengths)
    vectors <- vectors[, (n-1):1]
  }
  if (n > 2) {
    vectors <- vectors / rbind(lengths, lengths)
    perps <- rbind(-vectors[2, ], vectors[1, ])
    angles <- sapply(1:(n-2), function(i) {
      cosine <- sum(vectors[, i+1] * vectors[, i])
      sine <- sum(perps[, i+1] * vectors[, i])
      atan2(sine, cosine)
    })
    i <- min(which(angles != 0))
    angles <- sign(angles[i]) * angles
  } else angles <- numeric(0)
  list(lengths=lengths, angles=angles)
}
shapes.std <- lapply(shapes, standardize)
#
# Sort lexicographically.  (Not implemented: see the text.)
#
order.shape <- function(s) {
  order(sapply(s, function(s) s$lengths[1]))
}
i <- order.shape(shapes.std)
#
# Group.
#
equal.shape <- function(s.0, s.1) {
  same.length <- function(a,b) abs(a-b) <= (a+b) * 1e-8
  same.angle <- function(a,b) min(abs(a-b), abs(a-b)-2*pi) < 1e-11
  r <- function(u) {
    a <- u$angles
    if (length(a) > 0) {
      a <- rev(u$angles)
      i <- min(which(a != 0))
      a <- sign(a[i]) * a
    }
    list(lengths=rev(u$lengths), angles=a)
  }
  e <- function(u, v) {
    if (length(u$lengths) != length(v$lengths)) return (FALSE)
    all(mapply(same.length, u$lengths, v$lengths)) &&
      all(mapply(same.angle, u$angles, v$angles))
    }
  e(s.0, s.1) || e(r(s.0), s.1)
}
g <- rep(1, length(shapes.std))
for (j in 2:length(i)) {
  i.0 <- i[j-1]
  i.1 <- i[j]
  if (equal.shape(shapes.std[[i.0]], shapes.std[[i.1]])) 
    g[j] <- g[j-1] else g[j] <- g[j-1]+1
}
palette(rainbow(max(g)))
plot(range.shapes, range.shapes, type="n",asp=1, bty="n", xlab="", ylab="")
invisible(lapply(1:length(i), function(j) lines(t(shapes[[i[j]]]), col=g[j], lwd=2)))

Related Solutions

[GIS] Similarity between two raster maps

You can solve this using R. The below code produces a scatter plot with linear regression and a Pearson correlation value. Because you're solving in R you have access to a massive range of statistical tools.

install.packages(raster)
library(raster)

# read rasters
r1 = raster("/dir/dir/file1.tif")
r2 = raster("/dir/dir/file2.tif")
# Resample r2 to r1
r2.samp = round(resample(r2, r1, "bilinear"))

# plot results
# Points
plot(getValues(r2.samp) ~ getValues(r1))
# Linear regression
abline(lm(getValues(r2.samp) ~ getValues(r1)))
# (Pearson) Correlation
legend("topleft", legend=paste("Correlation =", round(cor(getValues(r1),
getValues(r2.samp), use="complete.obs"), 2)))

[GIS] Similarity between two or more trajectories

A really easy, but not fantastic measure is to get the Hausdorff distance between each combination, which is done with the ST_HausdorffDistance function. Using approximate LineStrings from your figure, these are all shown in blue, and the Hausdorff distance is shown for one of the pairs of lines in red:

Hausdorff distance

And the query to sort the 6 combinations in descending order:

WITH data AS (
  SELECT 'blue' AS name, 'LINESTRING (60 200, 110 290, 200 320, 330 320, 430 240, 450 200)'::geometry AS geom
  UNION SELECT 'black', 'LINESTRING (60 200, 120 270, 235 297, 295 207, 450 200)'::geometry
  UNION SELECT 'green', 'LINESTRING (60 200, 280 190, 450 200)'::geometry
  UNION SELECT 'red', 'LINESTRING (60 200, 150 210, 257 195, 360 210, 430 190, 450 200)'::geometry)
SELECT a.name || ' <-> ' || b.name AS compare, ST_HausdorffDistance(a.geom, b.geom)
FROM data a, data b WHERE a.name < b.name
ORDER BY ST_HausdorffDistance(a.geom, b.geom) DESC;

     compare     | st_hausdorffdistance
-----------------+----------------------
 blue <-> green  |                  130
 blue <-> red    |                  125
 black <-> blue  |     110.102502131467
 black <-> green |     104.846289061163
 black <-> red   |     97.9580173908678
 green <-> red   |     15.2677257073823
(6 rows)

So it works fine for this example, but it isn't a great or robust technique for clustering lines, since the only metric is the single point with the greatest distance, rather than comparing the differences of complete lines. There are much better methods, but they will be more complicated.

Best Answer

Example

Related Solutions

[GIS] Similarity between two raster maps

[GIS] Similarity between two or more trajectories

Related Question