Table Joins – How to Join a Table to a Shapefile with Non-Matching IDs and Names (Similar Strings)?

attribute-joinsdata-framersimilaritytable

I'm having an annoying problem that I'm trying to find an automated solution for. The shorthand version is that I have a shapefile and a table of created data for regions within countries. The created data table does NOT have any sort of standardized GIDs/admin codes to match to shapefiles, and the region names are not exact matches either. Let's take a closer look; here's my dummy data frame + shapefile.

library(rgdal)

#load in shapefile
arm <- readOGR("D:/Country-Shapefiles/ARM_adm_shp", layer = "ARM_adm1")

#create dummy data frame
id <- c(100:110)
name <- c("Aragatsotn", "Ararat", "Armavir", "Gaghark'unik'", "Kotayk", "Lorri", 
          "Shirak", "Syunik'", "Tavush", "Vayots' Dzor", "Yerevan City")
value <- runif(11, 0.0, 1.0)
df <- data.frame(id, name, value)

So what I have is a table with seemingly random IDs, region names, and a value to be plotted with a choropleth map. Looks like this:

> df
    id          name     value
1  100    Aragatsotn 0.6923852
2  101        Ararat 0.5762024
3  102       Armavir 0.4688358
4  103 Gaghark'unik' 0.4702253
5  104        Kotayk 0.9347992
6  105         Lorri 0.1937813
7  106        Shirak 0.5162604
8  107       Syunik' 0.4332389
9  108        Tavush 0.9889513
10 109  Vayots' Dzor 0.2182024
11 110  Yerevan City 0.5791886

Looking at the shapefile attributes of interest, we've got this:

> arm@data[c("ID_1", "NAME_1")]

       ID_1      NAME_1
    0     1  Aragatsotn
    1     2      Ararat
    2     3     Armavir
    3     4      Erevan
    4     5 Gegharkunik
    5     6      Kotayk
    6     7        Lori
    7     8      Shirak
    8     9      Syunik
    9    10      Tavush
    10   11 Vayots Dzor

Ideally, df would include some sort of matching admin IDs to join to the shapefile. Whoever created the data that I'm using did not follow these conventions, unfortunately. Alternatively, it'd be great to match the region names themselves…but as you can see, there are slight variations in each name.

Matching by hand is always a backup solution, but who wants to take the time to do that? 😉 But really, even barring laziness, the project that I'm working on will be mapping dozens and dozens of different countries, so I'm looking for an automated solution that can do everything without having to do anything by hand. Is this possible? Can I somehow match these -almost- region names to the shapefiles?

Sidenote: I'm looking into grepl for partial string matches per this post, but I'm not sure if this is a potential solution because I'll need to draw from the column names rather than inputting each region name by hand.

EDIT: When I match the IDs by hand, what I've done is create a new column in my data frame and adding the exact matching terms from the shapefile. Unfortunately, because of the peculiarities of the data, the order of the names don't match up either, so this still requires some manual input. I'm hoping for some sort of completely automated solution (if it's even possible).

Best Answer

I would go for stringdist package which has implemented many algorithms to calculate the partial similarity (distance) of strings including Jaro-winkler. Here is a fast solution for you:

  #df to be joined
  id <- c(100:111)
  name <- c("Aragatsotn", "Ararat", "Armavir", "Gaghark'unik'", "Kotayk", "Lorri", 
            "Shirak", "Syunik'", "Tavush", "Vayots' Dzor", "Yerevan City","Aragatsotn")
  value <- runif(12, 0.0, 1.0)
  df <- data.frame(id, name, value)

  #create shape data df
  shpNames <- c("Aragatsotn",
               "Ararat",
               "Armavir",
               "Erevan",
               "Gegharkunik",
               "Kotayk",
               "Lori",
               "Shirak",
               "Syunik",
               "Tavush",
               "VayotsDzor")
  arm.data  <- data.frame(ID_1=1:11,NAME_1=shpNames)

  #simple match (only testing)
  match(df$name,arm.data$NAME_1)
  #simple merge (testing)
  merge(arm.data,df,by.x="NAME_1",by.y="name",all.x=TRUE)

  #partial match using stringdist package
  library("stringdist")
  am<-amatch(arm.data$NAME_1,df$name,maxDist = 3)
  b<-data.frame()
  for (i in 1:dim(arm.data)[1]) {
      b<-rbind(b,data.frame(arm.data[i,],df[am[i],]))
  }
  b

it outputs:

ID_1      NAME_1  id          name     value
1     1  Aragatsotn 100    Aragatsotn 0.8510984
2     2      Ararat 101        Ararat 0.3004329
3     3     Armavir 102       Armavir 0.9258740
4     4      Erevan  NA          <NA>        NA
5     5 Gegharkunik 103 Gaghark'unik' 0.9935353
6     6      Kotayk 104        Kotayk 0.6025050
7     7        Lori 105         Lorri 0.9577662
8     8      Shirak 106        Shirak 0.6346550
9     9      Syunik 107       Syunik' 0.6531175
10   10      Tavush 108        Tavush 0.9726032
11   11  VayotsDzor 109  Vayots' Dzor 0.3457315

You can play with maxDist parameter of amatch method. Although 3 works best with your sample data!