R – How to Extract Non-Spatial Data from Imported KML File Using R

kmlrrgdalsp

I have a KML file which was created using Google's My Maps.

The original file can be downloaded here:
Google My Maps

Using R, I can import this using the "readOGR" function of the rgdal library
This brings the KML file in as a SpatialPointsDataFrame (SPDF) – which i am calling asf52

![RStudio Data Pane

In this SPDF, the spatial data is contained under @coords and is readily extracted into a dataframe using code like

df  <- data.frame(asf52@coords[,1:2])

However, I am struggling to come up with a way to neatly extract the the non-spatial data – contained under @data$Description – and turn it into a dataframe with a column for each variable.

Best Answer

You don't need to call data.frame() around the extract - the @data slot already is a data.frame. Just do

  df <- asf52@data

to pull out a copy. That said, you may be better served by using the newer sf library for this task:

library(sf)
ob_kml <- file.path(getwd(), 'Outbreaks 56 (OIE).kml')    

There is more than one layer in your KML - list them with e.g.

st_layers(ob_kml)

Use read_sf() with the layers argument to choose your point data specifically and read it in. read_sf() defaults to stringsAsFactors = FALSE which may be preferable.

asf_c <- read_sf(ob_kml, layer = 'ASF in China.xlsx')

To get a plain dataframe, just drop the geometry as follows:

asf_c_df <- st_set_geometry(asf_c, NULL)

EDIT: I see your secondary issue now; it looks like neither sf nor sp look at the <ExtendedData> tags that hold the attribute data you want (open the KML in Notepad++ if you want to see what I mean). QGIS does detect and import them as separate attribute columns, so @Jella's advice is sound. I'm not sure if the issue here lies with sf/sp or GDAL, but it may be worth raising an issue of the sf github page.

In the meantime, your instinct to go with tidyr functions is sound, its just a little tricky to get a clean separation. The following looks pretty good:

asf_c_df <- st_set_geometry(asf_c, NULL) %>%
  # remove duplicate <br> tags
  dplyr::mutate(Description = gsub('<br><br>', '<br>', Description)) %>%
  # split on <br>
  tidyr::separate(., col = Description, 
                  into = c('Date', 'Province', 'City', 'County', 'Location', 
                           'Total_herd_size', 'Affected_animals', 'Deaths',
                           'Culled', 'Latitude', 'Longitude', 'Source'), 
                  sep = '<br>') %>%
  # ditch the key: part of key: value
  dplyr::mutate_all(., funs(gsub('^.*: ', '', .))) %>%
  # data type fixes 
  dplyr::mutate_at(vars(7:10), as.integer) %>%
  dplyr::mutate_at(vars(11,12), as.numeric) %>%
  # bonus points: proper dates. First, fix September, then cast to Date datatype
  dplyr::mutate(Date = gsub('Sept', 'Sep', Date),
                Date = as.POSIXct(Date, format = '%b %d, %Y')) %>%
  # double bonus! proper NA for missing data
  dplyr::mutate_if(is.character, funs(ifelse(. == '', NA, .)))