R – How to Run a Geographically Weighted Logistic Regression

geographically-weighted-regressionlogistic regressionpolygonr

I have the following data set containing socioeconomic variables:

> glimpse(df)
Rows: 730,099
Columns: 9
$ id     <int> 25500, 25501, 25502, 25503, 25504, 25505, 25506, 25507, 25508, 25509, 25510, 25511, 25512, 25513, 25514, 25515, 25516, 255…
$ Prov   <fct> Al Hoceïma, Al Hoceïma, Al Hoceïma, Al Hoceïma, Al Hoceïma, Al Hoceïma, Al Hoceïma, Al Hoceïma, Al Hoceïma, Al Hoceïma, Al…
$ Age    <dbl> 45, 15, 65, 55, 35, 75, 45, 25, 55, 40, 35, 50, 70, 20, 30, 75, 50, 35, 50, 40, 35, 70, 70, 35, 35, 75, 55, 35, 25, 50, 30…
$ Edu    <dbl> 5, 4, 0, 0, 0, 0, 0, 16, 0, 5, 4, 0, 0, 6, 0, 0, 0, 0, 0, 0, 9, 0, 0, 14, 4, 0, 0, 3, 5, 0, 9, 0, 0, 0, 15, 0, 0, 0, 14, 3…
$ Kids   <int> 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 3, 4, 1, 0, 2, 0, 3, 2, 2, 1, 0, 5, 0, 2, 0, 0, 0, 1, 0, 2, 2, 1, 0, 5, 0, 0, 3, 0, 0, 1, 0,…
$ Mil    <fct> Urban, Rural, Rural, Rural, Rural, Rural, Rural, Rural, Rural, Rural, Rural, Rural, Urban, Urban, Rural, Urban, Urban, Rur…
$ DSize  <dbl+lbl> 2, 1, 4, 4, 1, 1, 3, 2, 3, 4, 1, 6, 3, 2, 5, 3, 4, 2, 4, 2, 3, 6, 3, 2, 4, 4, 3, 2, 2, 4, 1, 4, 6, 4, 5, 5, 4, 2, 5, 3…
$ taille <dbl> 2, 5, 5, 7, 5, 1, 2, 1, 4, 1, 5, 9, 5, 1, 5, 3, 7, 4, 6, 7, 2, 14, 3, 4, 1, 2, 9, 3, 3, 12, 4, 7, 2, 9, 2, 6, 5, 2, 4, 3, …
$ EP     <fct> 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1,…

The variable Prov indicates the 72 provinces of the country, it looks like this:

> levels(df$Prov)
 [1] "Agadir-Ida-Ou-Tanane"     "Al Haouz"                 "Al Hoceïma"               "Meknès"                   "Azilal"                  
 [6] "Béni Mellal"              "Benslimane"               "Berkane"                  "Berrechid"                "Boujdour"                
[11] "Boulemane"                "Casablanca"               "Chefchaouen"              "Chichaoua"                "Chtouka-Ait Baha"        
[16] "Driouch"                  "El Hajeb"                 "El Jadida"                "El Kelâa des Sraghna"     "Errachidia"              
[21] "Essaouira"                "Fahs-Anjra"               "Fès"                      "Figuig"                   "Fquih Ben Salah"         
[26] "Guelmim"                  "Guercif"                  "Ifrane"                   "Inezgane-Ait Melloul"     "Jerada"                  
[31] "Kénitra"                  "Khémisset"                "Khénifra"                 "Khouribga"                "Laâyoune"                
[36] "Larache"                  "Marrakech"                "Médiouna"                 "Midelt"                   "Mohammadia"              
[41] "Nador"                    "Nouaceur"                 "Ouarzazate"               "Ouezzane"                 "Oujda-Angad"             
[46] "Rabat"                    "Rehamna"                  "Safi"                     "Salé"                     "Sefrou"                  
[51] "Settat"                   "Sidi Bennour"             "Sidi Ifni"                "Sidi Kacem"               "Sidi Slimane"            
[56] "Skhirate-Témara"          "Tanger-Assilah"           "Taounate"                 "Taourirt"                 "Taroudannt"              
[61] "Tata"                     "Taza"                     "Tétouan"                  "M'Diq-Fnideq"             "Tinghir"                 
[66] "Tiznit"                   "Youssoufia"               "Zagora"                   "Moulay Yacoub"            "Tan-Tan / Assa-Zag"      
[71] "Es-Semara / Tarfaya"      "Oued Ed Dahab / Aousserd"

and I have another shapefile containing the location polygons for each province, it looks like this:

> map_3
Simple feature collection with 72 features and 1 field
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -17.10496 ymin: 20.7715 xmax: -0.9987581 ymax: 35.92243
Geodetic CRS:  WGS 84
# A tibble: 72 × 2
   PROV                                                                                                     geometry
 * <chr>                                                                                               <POLYGON [°]>
 1 Agadir-Ida-Ou-Tanane ((-9.504189 30.3498, -9.503555 30.34989, -9.503493 30.34991, -9.502746 30.35016, -9.50251...
 2 Al Haouz             ((-7.344661 31.67321, -7.344881 31.67344, -7.345 31.6736, -7.346125 31.67503, -7.346162 3...
 3 Al Hoceïma           ((-3.926041 35.26096, -3.926289 35.26113, -3.926532 35.26142, -3.926776 35.26171, -3.9270...
 4 Azilal               ((-5.908287 32.30598, -5.908415 32.3063, -5.913961 32.32005, -5.915668 32.32325, -5.92306...
 5 Benslimane           ((-7.195987 33.18768, -7.191716 33.18865, -7.169873 33.19075, -7.164904 33.19066, -7.1557...
 6 Berkane              ((-2.313083 34.58991, -2.312776 34.58996, -2.312381 34.59001, -2.308861 34.59052, -2.3085...
 7 Berrechid            ((-7.250729 33.21856, -7.249378 33.22028, -7.246216 33.22429, -7.245882 33.22472, -7.2454...
 8 Boujdour             ((-12.54495 24.46466, -12.49897 24.47041, -12.4321 24.48361, -12.38604 24.49452, -12.3777...
 9 Boulemane            ((-4.202123 32.58081, -4.200447 32.58095, -4.183846 32.58234, -4.17291 32.58488, -4.15931...
10 Casablanca           ((-7.669198 33.49706, -7.668895 33.49699, -7.66875 33.49705, -7.668693 33.49707, -7.66854...
# ℹ 62 more rows
# ℹ Use `print(n = ...)` to see more rows

When I want to run a simple logistic regression I simply run this command and it gives me the results I want:

> model <- glm(EP~Age+Edu+Kids+Mil+DSize+taille,family = binomial(link = "logit"), data = df)
> summary(model)

Call:
glm(formula = EP ~ Age + Edu + Kids + Mil + DSize + taille, family = binomial(link = "logit"), 
    data = df)

Coefficients:
              Estimate Std. Error  z value Pr(>|z|)    
(Intercept) -0.0040682  0.0165088   -0.246    0.805    
Age         -0.0060011  0.0002926  -20.511   <2e-16 ***
Edu         -0.1165767  0.0010738 -108.562   <2e-16 ***
Kids         0.0927567  0.0038183   24.293   <2e-16 ***
MilRural     1.5594703  0.0078742  198.048   <2e-16 ***
DSize       -0.4697581  0.0034187 -137.409   <2e-16 ***
taille      -0.1368148  0.0026621  -51.393   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 613335  on 719910  degrees of freedom
Residual deviance: 497842  on 719904  degrees of freedom
  (10188 observations deleted due to missingness)
AIC: 497856

Number of Fisher Scoring iterations: 6

Now I want to run a Geographically Weighted Logistic Regression, and for that I checked the GWModel package manual, and found the function ggwr.basic, which has an option for logistic regression.

DM<-gw.dist(dp.locat=coordinates(londonhp))
bw.f2 <- bw.ggwr(BATH2~FLOORSZ,data=londonhp, dMat=DM,family ="binomial")
res.binomial<-ggwr.basic(BATH2~FLOORSZ, bw=bw.f2,data=londonhp, dMat=DM,
              family ="binomial")

But in their example, the location variable is a point (X Y coordinates), whereas in my case the locations are polygons, also they have one observation per location (they use the LondonHP data set), whereas I have more than 730,000 observations scattered across 72 provinces.

I was able to deal with location variables being polygons not points:

> head(map_3)
Simple feature collection with 6 features and 4 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -9.890198 ymin: 30.34965 xmax: -3.802499 ymax: 35.26228
Geodetic CRS:  WGS 84
                  PROV         y      x.x      x.y                       geometry
1 Agadir-Ida-Ou-Tanane  9.809457 46.26182 5.899067 POLYGON ((-9.504189 30.3498...
2             Al Haouz 25.419273 47.62713 2.061563 POLYGON ((-7.344661 31.6732...
3           Al Hoceïma 16.325224 47.36490 2.993311 POLYGON ((-3.926041 35.2609...
4               Azilal 31.601461 48.64625 2.291929 POLYGON ((-5.908287 32.3059...
5          Béni Mellal 17.411668 48.87454 4.155205 POLYGON ((-6.047234 32.7797...
6           Benslimane 18.095435 47.83364 4.523673 POLYGON ((-7.195987 33.1876...

> spdf <- as(map_3, "Spatial")
> DM <- gw.dist(dp.locat = coordinates(spdf)) 
> bw.f2 <- bw.gwr(y~x.x+x.y,data=spdf, dMat=DM)
> res.binomial<-gwr.basic(y~x.x+x.y, bw=bw.f2,data=spdf, dMat=DM)
> summary(res.binomial)
              Length Class                    Mode
GW.arguments  11     -none-                   list
GW.diagnostic  8     -none-                   list
lm            14     lm                       list
SDF           72     SpatialPolygonsDataFrame S4  
timings        5     -none-                   list
this.call      5     -none-                   call
Ftests         0     -none-                   list

But I still can't find anything online similar to my problem, all the resources I found either use proportions as dependent variables, or assign a Yes/No to a location, but nothing like my problem.

Therefore, how can I run a Geographically Weighted Logistic Regression model on my data? I just gave the GWModel package as an example; if you know a different package or function that can get the job done, please feel free to use it.

Best Answer

I am going to go ahead a speak up here and apologize in advance for being a wet blanket. First, I find it quite dubious to use a logistic model for GWR (as with the method itself but, that is another discussion). It would be far too easy to end up with a homogeneous response on the y-side of the equation. GWR can be though of as a spatial kernel type regression so, fits a series of localized regressions that are then integrated to a global estimate, sort of. Since no first-order effect is fit to the data it is completely reliant on the local second-order fits. If, within a local kernel, all observations are the same value a model cannot be fit. This is commonly called the collinearity problem which is incorrect, it is related simply to data homogeneity. In standard GWR models this most often occurs in the design matrix but, would be considered particularly problematic if it occurred in the dependent variable.

Second, a GWR model should only ever be used when spatial nonstationarity (second-order spatial autocorrelation) is observed in the data. With a binomial process, this is a highly doubtful outcome and would be very difficult to quantify in the first place. Even with point data that has high first-order autocorrelation, I do not see GWR as a valid model because the first-order trend is global in nature whereas GWR is a local statistic, with no way of addressing the first-order trend (spatial or aspatial) that would result in a relevant representation of spatial process. I would also point out a logical fallacy, if you have high autocorrelation in binomial data, that would benefit from a GWR, you would also have considerable local homogeneity (look at Join-Counts literature on autocorrelation in binomial processes) which would most certainly result in homogeneous values within a give local fit. This very well may not be the case for multinomial data but, at that point, with most spatial-type data/inference, you are essentially dealing with a discrete/nominal process which is another can of worms in regard to spatial process.

Third, you simply cannot apply a GWR to polygon data. The neighbor contingency just does not work out in your favor. With point data, the assumption is that it is a independent, discrete spatial observation driven by some sort of distance relationship. However, with polygon data it is a demarcation of a discrete area that is non-uniform. The distance relationship is predicated by the size of the polygon and would be arbitrary. This is an influence that simply cannot be accounted for in the GWR modeling framework. The way that it is addressing nonstationarity just would not work for this type of data. Given enough thought you may be able to somehow use Nth-order neighbor contingency, along with area weights, in the local fits. However, GWR has been shown to be such an unstable model that I cannot imagine that this type of modification would be statistically justified.

Honestly, it is looking like you are wanting a mixed effects model with your random effect controlling for province or, if you do not want to control for province and are concerned over autocorrelation, an autoregressive model fit with an MCMC. You could also use a spatially lagged dependent variable as a random effect in a mixed effects model, which gets you around the issue of a lagged variable being correlated with anything in the design matrix (which is what necessitates the MCMC with autoregressive models).

Related Question