Solved – Showing spatial and temporal correlation on maps

data visualizationpcarregressionspatial

I have data for a network of weather stations across the United States. This gives me a data frame that contains date, latitude, longitude, and some measured value. Assume that data are collected once per day and driven by regional-scale weather (no, we are not going to get into that discussion).

I'd like to show graphically how simultaneously-measured values are correlated across time and space. My goal is to show the regional homogeneity (or lack thereof) of the value that is being investigated.

Data set

To start with, I took a group of stations in the region of Massachusetts and Maine. I selected sites by latitude and longitude from an index file that is available on NOAA's FTP site.

enter image description here

Straight away you see one problem: there are lots of sites that have similar identifiers or are very close. FWIW, I identify them using both the USAF and WBAN codes. Looking deeper in to the metadata I saw that they have different coordinates and elevations, and data stop at one site then start at another. So, because I don't know any better, I have to treat them as separate stations. This means the data contains pairs of stations that are very close to each other.

Preliminary Analysis

I tried grouping the data by calendar month and then calculating the ordinary least squares regression between different pairs of data. I then plot the correlation between all pairs as a line connecting the stations (below). The line color shows the value of R2 from the OLS fit. The figure then shows how the 30+ data points from January, February, etc. are correlated between different stations in the area of interest.

correlation between daily data during each calendar month

I've written the underlying codes so that the daily mean is only calculated if there are data points every 6-hour period, so data should be comparable across sites.

Problems

Unfortunately, there is simply too much data to make sense of on one plot. That can't be fixed by reducing the size of the lines.

I've tried plotting the correlations between the nearest neighbors in the region, but that turns into a mess very quickly. The facets below show the network without correlation values, using $k$ nearest neighbors from a subset of the stations. This figure was just to test the concept.
enter image description here

The network appears to be too complex, so I think I need to figure out a way to reduce the complexity, or apply some kind of spatial kernel.

I am also not sure what is the most appropriate metric to show correlation, but for the intended (non-technical) audience, the correlation coefficient from OLS might just be the simplest to explain. I may need to present some other information like the gradient or standard error as well.

Questions

I'm learning my way into this field and R at the same time, and would appreciate suggestions on:

  1. What's the more formal name for what I'm trying to do? Are there some helpful terms that would let me find more literature? My searches are drawing blanks for what must be a common application.
  2. Are there more appropriate methods to show the correlation between multiple data sets separated in space?
  3. … in particular, methods that are easy to show results from visually?
  4. Are any of these implemented in R?
  5. Do any of these approaches lend themselves to automation?

Best Answer

I think there are a few options for showing this type of data:

The first option would be to conduct an "Empirical Orthogonal Functions Analysis" (EOF) (also referred to as "Principal Component Analysis" (PCA) in non-climate circles). For your case, this should be conducted on a correlation matrix of your data locations. For example, your data matrix dat would be your spatial locations in the column dimension, and the measured parameter in the rows; So, your data matrix will consist of time series for each location. The prcomp() function will allow you to obtain the principal components, or dominant modes of correlation, relating to this field:

res <- prcomp(dat, retx = TRUE, center = TRUE, scale = TRUE) # center and scale should be "TRUE" for an analysis of dominant correlation modes)
#res$x and res$rotation will contain the PC modes in the temporal and spatial dimension, respectively.

The second option would be to create maps that show correlation relative to an individual location of interest:

C <- cor(dat)
#C[,n] would be the correlation values between the nth location (e.g. dat[,n]) and all other locations. 

EDIT: additional example

While the following example doesn't use gappy data, you could apply the same analysis to a data field following interpolation with DINEOF (http://menugget.blogspot.de/2012/10/dineof-data-interpolating-empirical.html). The example below uses a subset of monthly anomaly sea level pressure data from the following data set (http://www.esrl.noaa.gov/psd/gcos_wgsp/Gridded/data.hadslp2.html):

library(sinkr) # https://github.com/marchtaylor/sinkr

# load data
data(slp)

grd <- slp$grid
time <- slp$date
field <- slp$field

# make anomaly dataset
slp.anom <- fieldAnomaly(field, time)

# EOF/PCA of SLP anom
P <- prcomp(slp.anom, center = TRUE, scale. = TRUE)

expl.var <- P$sdev^2 / sum(P$sdev^2) # explained variance
cum.expl.var <- cumsum(expl.var) # cumulative explained variance
plot(cum.expl.var)

Map the leading EOF mode

# make interpolation
require(akima)
require(maps)

eof.num <- 1
F1 <- interp(x=grd$lon, y=grd$lat, z=P$rotation[,eof.num]) # interpolated spatial EOF mode


png(paste0("EOF_mode", eof.num, ".png"), width=7, height=6, units="in", res=400)
op <- par(ps=10) #settings before layout
layout(matrix(c(1,2), nrow=2, ncol=1, byrow=TRUE), heights=c(4,2), widths=7)
#layout.show(2) # run to see layout; comment out to prevent plotting during .pdf
par(cex=1) # layout has the tendency change par()$cex, so this step is important for control

par(mar=c(4,4,1,1)) # I usually set my margins before each plot
pal <- jetPal
image(F1, col=pal(100))
map("world", add=TRUE, lwd=2)
contour(F1, add=TRUE, col="white")
box()

par(mar=c(4,4,1,1)) # I usually set my margins before each plot
plot(time, P$x[,eof.num], t="l", lwd=1, ylab="", xlab="")
plotRegionCol()
abline(h=0, lwd=2, col=8)
abline(h=seq(par()$yaxp[1], par()$yaxp[2], len=par()$yaxp[3]+1), col="white", lty=3)
abline(v=seq.Date(as.Date("1800-01-01"), as.Date("2100-01-01"), by="10 years"), col="white", lty=3)
box()
lines(time, P$x[,eof.num])
mtext(paste0("EOF ", eof.num, " [expl.var = ", round(expl.var[eof.num]*100), "%]"), side=3, line=1) 

par(op)
dev.off() # closes device

enter image description here

Create correlation map

loc <- c(-90, 0)
target <- which(grd$lon==loc[1] & grd$lat==loc[2])
COR <- cor(slp.anom)
F1 <- interp(x=grd$lon, y=grd$lat, z=COR[,target]) # interpolated spatial EOF mode


png(paste0("Correlation_map", "_lon", loc[1], "_lat", loc[2], ".png"), width=7, height=5, units="in", res=400)

op <- par(ps=10) #settings before layout
layout(matrix(c(1,2), nrow=2, ncol=1, byrow=TRUE), heights=c(4,1), widths=7)
#layout.show(2) # run to see layout; comment out to prevent plotting during .pdf
par(cex=1) # layout has the tendency change par()$cex, so this step is important for control

par(mar=c(4,4,1,1)) # I usually set my margins before each plot
pal <- colorRampPalette(c("blue", "cyan", "yellow", "red", "yellow", "cyan", "blue"))
ncolors <- 100
breaks <- seq(-1,1,,ncolors+1)
image(F1, col=pal(ncolors), breaks=breaks)
map("world", add=TRUE, lwd=2)
contour(F1, add=TRUE, col="white")
box()

par(mar=c(4,4,0,1)) # I usually set my margins before each plot
imageScale(F1, col=pal(ncolors), breaks=breaks, axis.pos = 1)
mtext("Correlation [R]", side=1, line=2.5)
box()

par(op)

dev.off() # closes device

enter image description here

Related Question