Spatial Autocorrelation – Including Latitude and Longitude in a GAM to Account for Spatial Autocorrelation

autocorrelationgeneralized-additive-modelmodelingrspatial

I have produced generalized additive models for deforestation. To account for spatial-autocorrelation, I have included latitude and longitude as a smoothed, interaction term (i.e. s(x,y)).

I've based this on reading many papers where the authors say 'to account for spatial autocorrelation, coordinates of points were included as smoothed terms' but these have never explained why this actually accounts for it. It's quite frustrating. I've read all the books I can find on GAMs in the hope of finding an answer, but most (e.g. Generalized Additive Models, an Introduction with R, S.N. Wood) just touch on the subject without explaining.

I'd really appreciate it if someone could explain WHY the inclusion of latitude and longitude accounts for spatial autocorrelation, and what 'accounting' for it really means – is it simply enough to include it in the model, or should you compare a model with s(x,y) in and a model without? And does the deviance explained by the term indicate the extent of spatial autocorrelation?

Best Answer

The main issue in any statistical model is the assumptions that underlay any inference procedure. In the sort of model you describe, the residuals are assumed independent. If they have some spatial dependence and this is not modelled in the sytematic part of the model, the residuals from that model will also exhibit spatial dependence, or in other words they will be spatially autocorrelated. Such dependence would invalidate the theory that produces p-values from test statistics in the GAM for example; you can't trust the p-values because they were computed assuming independence.

You have two main options for handling such data; i) model the spatial dependence in the systematic part of the model, or ii) relax the assumption of independence and estimate the correlation between residuals.

i) is what is being attempted by including a smooth of the spatial locations in the model. ii) requires estimation of the correlation matrix of the residuals often during model fitting using a procedure like generalised least squares. How well either of these approaches deal with the spatial dependence will depend upon the nature & complexity of the spatial dependence and how easily it can be modelled.

In summary, if you can model the spatial dependence between observations then the residuals are more likely to be independent random variables and therefore not violate the assumptions of any inferential procedure.

Related Solutions

Solved – ML model selection for prediction of latitude and longitude

Here are my suggestions regarding your problem:

Yes, I bet your problem is non-linear but I recommend you to try linear models first. First, it will give you a baseline: Your final model should outperform the linear one. Second, running linear models and inspecting their weights sometimes provide intuition regarding the problem. Finally, there truly are cases that linear models outperform non-linear ones, depending on the number of data, the number of features, and the problem domain.
In other words, try simpler idea first. For example, concerning spatial autocorrelation will be a good idea, but you don't have to if your model works well without it.
By "prediction of latitude and longitude" I assume you are solving a regression problem where output is two real, bounded values.
- The first try would be the treating latitude and longitude separately: using two unrelated models to predict each of them.

Autocorrelation – Conducting Beta Regression with Residual Spatial Auto-Correlation in R using Generalized Additive Model (MGCV)

We can think of our observations as arising from some distribution with a mean structure component and a covariance component. Essentially we have

$$y = \boldsymbol{X\beta} + \mathbf{Zb} + \epsilon$$

where $\mathbf{X}$ and $\mathbf{Z}$ are design matrices of fixed and random effects respectively and $\epsilon$ is the unexplained variation.

We can model spatial or temporal autocorrelation by including in our model something that accounts for the spatial or temporal separation of the observations. We can do this either in the fixed/random effects part of the model or in the covariance structure of model.

For example, in a simple linear regression model assuming independent observations we have

$$y \sim \mathcal{N}(\boldsymbol{X\beta}, \sigma_{\epsilon}^2\mathbf{I})$$

where $\mathbf{I}$ is an identity matrix (hence the i.i.d. assumption). We might proceed to include a spatial or temporal correlation effect via $\mathbf{Zb}$, using basis functions, such that our model becomes

$$y \sim \mathcal{N}(\boldsymbol{X\beta} + \mathbf{Zb}, \sigma_{\epsilon}^2\mathbf{I})$$

This is known as the first-order form.

Alternatively we include the spatial or temporal correlation in the covariance function of the random effects, in that case our model might be

$$y = \boldsymbol{X\beta} + \boldsymbol{\eta} + \epsilon$$

where $\boldsymbol{\epsilon} \sim \mathcal{N}(0, \sigma_{\epsilon}^2\mathbf{I})$ and $\boldsymbol{\eta} \sim \mathcal{N}(0, \sigma_{\alpha}^2\mathbf{R})$ where the random effect $\boldsymbol{\eta}$ results in correlated errors, resulting in

$$y \sim \mathcal{N}(\mathbf{X\beta}, \sigma_{\epsilon}^2\mathbf{I} + \sigma_{\alpha}^2\mathbf{R})$$

Here $\mathbf{R}$ is specified via a correlation function such as the exponential correlation function you mentions.

This is known as the second-order form. The second-order form might be more common, especially in spatial statistics with the ecological and environmental sciences, but the first-order form is useful too.

These two forms can be equivalent for some models, where the basis functions in $\boldsymbol{Z}$ can be derived from the correlation function or matrix $\boldsymbol{R}$.

The above was cribbed from sections of Hefley et al (submitted), which is available on arXiv as a preprint.

(A third form might be $y \sim \mathcal{N}(\boldsymbol{X\beta}, \sigma_{\epsilon}^2\mathbf{R})$, which is what gamm() with correlation = corFOO() produces.)

As to why I mention this, I believe you can achieve what you want with gam() via a first-order form model derived from the spline-equivalent of kriging.

For this you would use the following model in R:

mod <- gam(y ~ x1 + x2 + s(latitude, longitude, bs = "gp", m = 2), 
           family = betar(link='logit'), 
           data = data)

Implicit here is that the splines will be treated as random effects (with the components of any penalty null space as fixed effects) but for none-general-family functions you could request this via method = "REML" or "ML". With m = 2, this selects a power exponential correlation function with range $r$ estimated from the data according to the method of Kammann and Wand (2003):

$$\hat{r} = \max_{1 \leq i, j \leq n} \left\lVert x_i - x_j \right\rVert$$

and power = 1. If you want to specify the range and or the power, then you need to supply a vector to m: m = c(2, 100, 1) would be a power exponential function with range parameter 100 and power 1. Other values of the m (or the first element when specified in vector form give different correlation function including spherical and three Matern covariance functions).

The assumption now is that given x and y and the random effects $\boldsymbol{Z}$ (given by the Gaussian process spline == kriging) and any model parameters, the residuals are i.i.d. Whether this is the case will depend on how flexible the Guassian process (kriging) part of the model is.

With this method I don't think specify a nugget and you have to manually specify the range parameter unless you want it to be taken as the largest separation between any two points in the sample. The detail for the implementation is in ?mgcv:::smooth.construct.gp.smooth.spec

You can read more about the first-order and second-order forms in a paper by Hefley et al (submitted).

I will also add that in practice, what you've already done using a thin-plate spline for location is also a first-order form model and hence you might not be able to do much or any better with the GP spline I mention above. Information in Hefley et al (submitted) might direct you at alternative ways to approach this model, perhaps using Bayesian methods where you might have more control of exactly how the spatial structure can be specified.

Hefley, T. J., Broms, K. M., Brost, B. M., Buderman, F. E., Kay, S. L., Scharf, H. R., … Hooten, M. B. (2016, June 17). The basis function approach for modeling autocorrelation in ecological data. arXiv [stat.AP]. Retrieved from http://arxiv.org/abs/1606.05658

Kammann, E. E., & Wand, M. P. (2003). Geoadditive models. Journal of the Royal Statistical Society. Series C, Applied Statistics, 52(1), 1–18. http://doi.org/10.1111/1467-9876.00385

Best Answer

Related Solutions

Solved – ML model selection for prediction of latitude and longitude

Autocorrelation – Conducting Beta Regression with Residual Spatial Auto-Correlation in R using Generalized Additive Model (MGCV)

Related Question