I have data for a network of weather stations across the United States. This gives me a data frame that contains date, latitude, longitude, and some measured value. Assume that data are collected once per day and driven by regional-scale weather (no, we are not going to get into that discussion).
I'd like to show graphically how simultaneously-measured values are correlated across time and space. My goal is to show the regional homogeneity (or lack thereof) of the value that is being investigated.
Data set
To start with, I took a group of stations in the region of Massachusetts and Maine. I selected sites by latitude and longitude from an index file that is available on NOAA's FTP site.
Straight away you see one problem: there are lots of sites that have similar identifiers or are very close. FWIW, I identify them using both the USAF and WBAN codes. Looking deeper in to the metadata I saw that they have different coordinates and elevations, and data stop at one site then start at another. So, because I don't know any better, I have to treat them as separate stations. This means the data contains pairs of stations that are very close to each other.
Preliminary Analysis
I tried grouping the data by calendar month and then calculating the ordinary least squares regression between different pairs of data. I then plot the correlation between all pairs as a line connecting the stations (below). The line color shows the value of R2 from the OLS fit. The figure then shows how the 30+ data points from January, February, etc. are correlated between different stations in the area of interest.
I've written the underlying codes so that the daily mean is only calculated if there are data points every 6-hour period, so data should be comparable across sites.
Problems
Unfortunately, there is simply too much data to make sense of on one plot. That can't be fixed by reducing the size of the lines.
I've tried plotting the correlations between the nearest neighbors in the region, but that turns into a mess very quickly. The facets below show the network without correlation values, using $k$ nearest neighbors from a subset of the stations. This figure was just to test the concept.
The network appears to be too complex, so I think I need to figure out a way to reduce the complexity, or apply some kind of spatial kernel.
I am also not sure what is the most appropriate metric to show correlation, but for the intended (non-technical) audience, the correlation coefficient from OLS might just be the simplest to explain. I may need to present some other information like the gradient or standard error as well.
Questions
I'm learning my way into this field and R at the same time, and would appreciate suggestions on:
- What's the more formal name for what I'm trying to do? Are there some helpful terms that would let me find more literature? My searches are drawing blanks for what must be a common application.
- Are there more appropriate methods to show the correlation between multiple data sets separated in space?
- … in particular, methods that are easy to show results from visually?
- Are any of these implemented in R?
- Do any of these approaches lend themselves to automation?
Best Answer
I think there are a few options for showing this type of data:
The first option would be to conduct an "Empirical Orthogonal Functions Analysis" (EOF) (also referred to as "Principal Component Analysis" (PCA) in non-climate circles). For your case, this should be conducted on a correlation matrix of your data locations. For example, your data matrix
dat
would be your spatial locations in the column dimension, and the measured parameter in the rows; So, your data matrix will consist of time series for each location. Theprcomp()
function will allow you to obtain the principal components, or dominant modes of correlation, relating to this field:The second option would be to create maps that show correlation relative to an individual location of interest:
EDIT: additional example
While the following example doesn't use gappy data, you could apply the same analysis to a data field following interpolation with DINEOF (http://menugget.blogspot.de/2012/10/dineof-data-interpolating-empirical.html). The example below uses a subset of monthly anomaly sea level pressure data from the following data set (http://www.esrl.noaa.gov/psd/gcos_wgsp/Gridded/data.hadslp2.html):
Map the leading EOF mode
Create correlation map