Autocorrelation – Conducting Beta Regression with Residual Spatial Auto-Correlation in R using Generalized Additive Model (MGCV)

autocorrelationbeta distributiongeneralized-additive-modelmgcvspatial

I have data on the interval (0,1), that I model using beta regression with the betareg package in R. This works well. However, I would like to account for some level of spatial auto-correlation of the residuals. I have done this in the past for linear models using the gls function of the nlme package in R.

The closest I can get to this is using the gam function within the mgcv package for R. I can specify a beta distribution and fit an extra term that is a spline function of latitude and longitude. Code here:

library(mgcv)
mod <- gam(y ~ x1 + x2 + s(latitude, longitude), 
           family = betar(link='logit'), 
           data = data)

However, this fits y to the spatial relationship, rather than the residuals. How can I go about fitting the residuals? I can do this by adding a correlation structure using the gamm from the mgcv package, and adding an uninformative random effect, however I'm pretty sure this does not play well with beta regression. All of the output I have received is non-sensical (i.e. (-) r2 values, compared to output before spatial autocorrelation was modeled I obtained r2 of 0.20-0.30. It also gives a warning that the "extended family" of distributions in mgcv are not compatible with gamm. So, its not surprising this does not work.

mod <- gamm(y ~ x1 + x2,  
            random = list(rando=~1), 
            family = betar(link='logit'), 
            correlation =  corExp(form = ~longitude + latitude, nugget=T),
            data = data)

Is there any way to account for auto-correlation in a beta-distributed response variable? I imagine this may be hard, as the fitted magnitude of residuals are not always as large for one side of the distribution as the other.

Best Answer

We can think of our observations as arising from some distribution with a mean structure component and a covariance component. Essentially we have

$$y = \boldsymbol{X\beta} + \mathbf{Zb} + \epsilon$$

where $\mathbf{X}$ and $\mathbf{Z}$ are design matrices of fixed and random effects respectively and $\epsilon$ is the unexplained variation.

We can model spatial or temporal autocorrelation by including in our model something that accounts for the spatial or temporal separation of the observations. We can do this either in the fixed/random effects part of the model or in the covariance structure of model.

For example, in a simple linear regression model assuming independent observations we have

$$y \sim \mathcal{N}(\boldsymbol{X\beta}, \sigma_{\epsilon}^2\mathbf{I})$$

where $\mathbf{I}$ is an identity matrix (hence the i.i.d. assumption). We might proceed to include a spatial or temporal correlation effect via $\mathbf{Zb}$, using basis functions, such that our model becomes

$$y \sim \mathcal{N}(\boldsymbol{X\beta} + \mathbf{Zb}, \sigma_{\epsilon}^2\mathbf{I})$$

This is known as the first-order form.

Alternatively we include the spatial or temporal correlation in the covariance function of the random effects, in that case our model might be

$$y = \boldsymbol{X\beta} + \boldsymbol{\eta} + \epsilon$$

where $\boldsymbol{\epsilon} \sim \mathcal{N}(0, \sigma_{\epsilon}^2\mathbf{I})$ and $\boldsymbol{\eta} \sim \mathcal{N}(0, \sigma_{\alpha}^2\mathbf{R})$ where the random effect $\boldsymbol{\eta}$ results in correlated errors, resulting in

$$y \sim \mathcal{N}(\mathbf{X\beta}, \sigma_{\epsilon}^2\mathbf{I} + \sigma_{\alpha}^2\mathbf{R})$$

Here $\mathbf{R}$ is specified via a correlation function such as the exponential correlation function you mentions.

This is known as the second-order form. The second-order form might be more common, especially in spatial statistics with the ecological and environmental sciences, but the first-order form is useful too.

These two forms can be equivalent for some models, where the basis functions in $\boldsymbol{Z}$ can be derived from the correlation function or matrix $\boldsymbol{R}$.

The above was cribbed from sections of Hefley et al (submitted), which is available on arXiv as a preprint.

(A third form might be $y \sim \mathcal{N}(\boldsymbol{X\beta}, \sigma_{\epsilon}^2\mathbf{R})$, which is what gamm() with correlation = corFOO() produces.)

As to why I mention this, I believe you can achieve what you want with gam() via a first-order form model derived from the spline-equivalent of kriging.

For this you would use the following model in R:

mod <- gam(y ~ x1 + x2 + s(latitude, longitude, bs = "gp", m = 2), 
           family = betar(link='logit'), 
           data = data)

Implicit here is that the splines will be treated as random effects (with the components of any penalty null space as fixed effects) but for none-general-family functions you could request this via method = "REML" or "ML". With m = 2, this selects a power exponential correlation function with range $r$ estimated from the data according to the method of Kammann and Wand (2003):

$$\hat{r} = \max_{1 \leq i, j \leq n} \left\lVert x_i - x_j \right\rVert$$

and power = 1. If you want to specify the range and or the power, then you need to supply a vector to m: m = c(2, 100, 1) would be a power exponential function with range parameter 100 and power 1. Other values of the m (or the first element when specified in vector form give different correlation function including spherical and three Matern covariance functions).

The assumption now is that given x and y and the random effects $\boldsymbol{Z}$ (given by the Gaussian process spline == kriging) and any model parameters, the residuals are i.i.d. Whether this is the case will depend on how flexible the Guassian process (kriging) part of the model is.

With this method I don't think specify a nugget and you have to manually specify the range parameter unless you want it to be taken as the largest separation between any two points in the sample. The detail for the implementation is in ?mgcv:::smooth.construct.gp.smooth.spec

You can read more about the first-order and second-order forms in a paper by Hefley et al (submitted).

I will also add that in practice, what you've already done using a thin-plate spline for location is also a first-order form model and hence you might not be able to do much or any better with the GP spline I mention above. Information in Hefley et al (submitted) might direct you at alternative ways to approach this model, perhaps using Bayesian methods where you might have more control of exactly how the spatial structure can be specified.

Hefley, T. J., Broms, K. M., Brost, B. M., Buderman, F. E., Kay, S. L., Scharf, H. R., … Hooten, M. B. (2016, June 17). The basis function approach for modeling autocorrelation in ecological data. arXiv [stat.AP]. Retrieved from http://arxiv.org/abs/1606.05658

Kammann, E. E., & Wand, M. P. (2003). Geoadditive models. Journal of the Royal Statistical Society. Series C, Applied Statistics, 52(1), 1–18. http://doi.org/10.1111/1467-9876.00385