[GIS] Accounting for spatial autocorrelation in rasters

autocorrelationrraster

Background:

I have a set of fine resolution rasters for the Florida peninsula. These rasters are going to be used in species distribution modeling. Many of these rasters represent variables that are clearly going to be related, so prior to using them in the model I'm performing a global PCA to generate new rasters. All of my rasters are of continuous data (no binary data or categorical variables).

However, I think I can make the case that it doesn't make sense to perform a global PCA for such a large area where the values of the variables are going to vary dramatically from one part of the state to another. So, I was considering going with a geographically weighted PCA (gwpca) instead, and was thinking that it might make sense to base the radius of the gwpca on the distance of spatial autocorrelation. Ideally, I would determine the distance individually for each pixel, but I don't know of any software that can do this, any paper or website that outlines the process, or even if it's computationally viable if I were to try and code my own solution. Alternatively, the backup plan is to come up with a single estimate and use it across the raster. This spatial autocorrelation distance would also be used to thin out high concentrations of points in my point data that is going to be used for generating the model.

Question:

How do I determine the distance at which spatial autocorrelation occurs? Should I do it on a per pixel, per site, and/or per raster basis?

What I've done so far:

Most of my work is done in R (I also use tools like GDAL and GRASS. I'm on Linux, so cannot use ESRI gis products).

I can take a raster and find the global Moran's I. I can also subset it using any given point and a radius, and use this to generate semivariograms and correlograms. I have a basic understanding of sill, nugget, and range, but am not aware of any how I could automate the process of determining the distance in which spatial autocorrelation is a factor.

Best Answer

The autocorrelation of covariates is not the problem in itself. What may be a problem is if there is correlation of covariates at the position of data observations. In that case, there will be identifiability issues to estimate the parameters of your SDM model.
What people usually do is to test for correlation between covariates at observation points. When two covariates are correlated, you can combine with PCA like you plan to do or choose one of the two covariates (with biological a priori sense).
What I do is to test a model with the first covariate, then the second one and then both together. But the test is a k-fold cross-validation so that useless additional covariates will not be retained, hence if the correlation between covariates is too high, only the best one is retained. But if there is a little information coming from both of them, you may want to keep both of them. Even if there is correlation. Environmental covariates always shows correlation because temperature is linked to altitude, because rain is linked to wind, etc...
Even if you have high resolution covariates, I would recommend to disaggregate them and test combination of all covariates, both with high resolution and lower resolution. Resolution will capture different scale effects and you may be surprised by the outputs.
By the way, the coincidence (but no correlation...) is that I just released a R-package on github that may help you to do that. I present it on my website and the "SDM_Selection" vignette will show you my own way of doing SDM and covariates selection: https://statnmap.com/sdmselect-package-species-distribution-modelling/

Edit

The model will be built on your point dataset. If there are identifiability or multicollinearity issues, it will be because of your dataset, not the external data you do not use for modeling.

You can see it the other way: There may be some environmental data completely not related/correlated, but your sampling plan is biased. e.g. imagine each time you observed in the forest it was a rainy month and each time you went for observations in the swamp it was a sunny month, hence there will be correlation between cover type and rain monthly rate. At the same latitude and altitude, let's say there is few possibility of general correlation between these two covariates, but in your dataset, these two will be correlated. Similarly, you may find correlation between global covariates but not in your dataset because your sampling plan counterbalanced the correlation. Because the correlation is not 100%, this means there are chances for a combination of sampling position for which there is no correlation in the covariates.
Thus, you need to verify this correlation inside your dataset.

There is a recent blog post about multicollinearity: https://datascienceplus.com/multicollinearity-in-r/

Related Question