Solved – How to model time-series temperature data at multiple sites as a function of data at one site

multivariate analysisregressionspatio-temporaltime series

I am new to time series analysis, and would appreciate any suggestions on how best to approach the following time-series regression problem: I have hourly temperature measurements at approximately 20 locations across one site over three years, along with static ancillary information (slope, elevation, aspect, canopy cover). The site is several hectares in size, and the temperature recording devices are spread across the site along a couple of transects, at ~20-50 m intervals. About 1 km away, I have hourly data from a weather station, which also provides measurements of wind speed, wind direction, humidity, solar illumination, etc.

I would like to be able to predict the temperature (min,max,mean) at the site (in general) using only the data from the weatherstation; it is in place semi-permanently, whereas the temperature recorders at the site were only in place for 3 years. So in essence I have multiple independent variables (temperature, humidity, wind, etc) at one location (the weatherstation), but a single dependent variable (temperature) at multiple locations, each of which also has several time-invariant attributes: slope, elevation, aspect, etc.

I am most interested in predicting the daily lows and highs at the site in general, rather than hourly temperatures at each temperature recording location in the site. Although, those hourly predictions would certainly be of value.

My initial approach has been to compute daily average, minimum, and maximums from the temperatures at the site, and use these as dependent variables in simple linear regressions, using the measurements available at the weatherstation as independent variables. This works reasonably well (R2 > 0.50 with 2 predictors), but seems rather too simplistic for many reasons, and I imagine there must be more sophisticated (and powerful) ways to do this.

For one, I'm not doing anything explicit about the time-series nature of the daily values in the regression, and although the min or average temp from one day to the next may not be as correlated as it is from one hour to the next, I wonder about issues with the independence of these daily data (or certainly hourly, if I were trying to predict hourly temperatures). Second, due to concerns with having multiple somewhat-correlated temperature measurements across the site (they are much more similar among themselves than any are to the weather station data), I am simply using the mean or min or max of all measurements across the site, versus including the data from each individual measurement location directly. But this also prevents me from using the time-invariant ancillary information from each temperature measurement location (slope, elevation, aspect, canopy cover), which presumably will explain a good part of the differences in temperatures between locations at the site. Third, due to concerns with the regression being dominated by the very strong diurnal cycle in temperatures, I'm only looking at daily values instead of hourly.

Any suggestions on better ways to go about this (especially in R), or where to start looking, would be most appreciated! I realize there are alot of R packages that deal with time-series, but I'm having trouble finding the best place to start with this type of problem as none of the examples I've seen really seem to reflect the situation I'm trying to model here.

Update: thinking about this a bit more, it is not clear to me whether time-series models are really appropriate here because I am not interested in predicting what will happen at some future specific point in time. Rather, I'm simply interested in how temperatures at the site are related to temperatures (and other environmental variables) at the weatherstation. I thought that perhaps time-series analysis would be of value because I was concerned that subsequent temperature measurements might not be sufficiently independent. Certainly, one hour's temperature depends a great deal on the previous hour, but the dependence is weaker for daily data. In either case, is the time-correlation/non-independence of time-series data a valid concern that should be addressed if one is not interested in a time-series prediction?

Best Answer

You may want to examine the GAM package in R, as it can be adapted to do some (or all) of what you are looking for. The original paper (Hastie & Tibshirani, 1986) is available via OpenAccess if you're up for reading it.

Essentially, you model a single dependent variable as being an additive combination of 'smooth' predictors. One of the typical uses is to have time series and lags thereof as your predictors, smooth these inputs, then apply GAM.

This method has been used extensively to estimate daily mortality as a function of smoothed environmental time series, especially pollutants. It's not OpenAccess, but (Dominici et al., 2000) is a superb reference, and (Statistical Methods for Environmental Epidemiology with R) is an excellent book on how to use R to do this type of analysis.

Related Question