Solved – Why does this logistic GAM fit so poorly

generalized-additive-modelloesslogisticmgcvr

I am trying to create a logistic regression model with mgcv::gam
with what I think is a simple decision boundary, but the model
I build performs very poorly. A local regression model built
using locfit::locfit on the same data finds the boundary very easily.
I want to add additional parametric regressors to my real-life model, so
I do not want to switch to a purely local regression.

I want to understand why GAM is having trouble fitting the data,
and whether there was ways of specifying the smooths that
can perform better.

Here's a simplified, reproducible example:

Ground truth is 1 = point lies within the unit circle, 0 if outside

e.g. z = 1 if sqrt(x^2 + y^2) <= 1, 0 otherwise

The observed data is noisy, with both false positives and false negatives

Construct a logistic regression to predict whether a point
is inside the circle or not, based on the point's Cartesian
coordinates.

Local regression can find the boundary well (50% probability contour
is very close to the unit circle), but a logistic GAM consistently
overestimates the size of the circle for the same probability band.

library(ggplot2)
library(locfit)
library(mgcv)
library(plotrix)

set.seed(0)
radius <- 1 # actual boundary
n <- 10000 # data points
jit <- 0.5 # noise factor

# Simulate random data, add polar coordinates
df <- data.frame(x=runif(n,-3,3), y=runif(n,-3,3))
df$r <- with(df, sqrt(x^2+y^2))
df$theta <- with(df, atan(y/x))

# Noisy indicator for inside the boundary
df$inside <- with(df, ifelse(r < radius + runif(nrow(df),-jit,jit),1,0))

# Plot data, shows ragged edge
(ggplot(df, aes(x=x, y=y, color=inside)) + geom_point() + coord_fixed() + xlim(-4,4) + ylim(-4,4))

enter image description here

### Model boundary condition using x,y coordinates

### local regression finds the boundary pretty accurately
m.locfit <- locfit(inside ~ lp(x,y, nn=0.3), data=df, family="binomial")
plot(m.locfit, asp=1, xlim=c(-2,-2,2,2))
draw.circle(0,0,1, border="red")

enter image description here

### But GAM fits very poorly, also tried with fx=TRUE but didn't help
m.gam <- gam(inside ~ s(x,y), data=df, family=binomial)
plot(m.gam, trans=plogis, se=FALSE, rug=FALSE)
draw.circle(0,0,1, border="red")

enter image description here

### gam.check doesn't indicate a problem with the model itself
gam.check(m.gam)

Method: UBRE   Optimizer: outer newton
full convergence after 8 iterations.
Gradient range [5.41668e-10,5.41668e-10]
(score -0.815746 & scale 1).
Hessian positive definite, eigenvalue range [0.0002169789,0.0002169789].

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

           k'    edf k-index p-value
s(x,y) 29.000 13.795   0.973    0.08

#### Try using polar coordinates

### Again, locfit works well
m.locfit2 <- locfit(inside ~ lp(r, nn=0.3), data=df, family="binomial")
plot(m.locfit2)
abline(v=1, col="red")

enter image description here

### But GAM misses again
m.gam2 <- gam(inside ~ s(r, k=50), data=df, family=binomial)
plot(m.gam2, se=FALSE, rug=FALSE, trans=plogis)
abline(v=1, col="red")

enter image description here

### Can also plot gam on link scale for alternate view
plot(m.gam2, se=FALSE, rug=FALSE)
abline(v=1, col="red")

enter image description here

gam.check(m.gam2)

Method: UBRE   Optimizer: outer newton
full convergence after 4 iterations.
Gradient range [-3.29203e-08,-3.29203e-08]
(score -0.8240065 & scale 1).
Hessian positive definite, eigenvalue range [7.290233e-05,7.290233e-05].

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

         k'    edf k-index p-value
s(r) 49.000 10.537   0.979    0.06

Best Answer

You are ignoring the model intercept when evaluating the model fit. The plot method shows the fitted spline, but the model includes a parametric constant term, just like the intercept in a standard logistic regression model.

Instead, predict from the fitted model using the predict() method for locations on a grid of locations over the interval. For example:

m.gam <- gam(inside ~ te(x, y), data=df, family=binomial, method = "REML")
locs <- with(df,
             data.frame(x = seq(min(x), max(x), length = 100),
                        y = seq(min(y), max(y), length = 100)))
pred <- expand.grid(locs)
pred <- transform(pred,
                  fitted = predict(m.gam, newdata = pred, type = "response"))
contour(locs$x, locs$y, matrix(pred$fitted, ncol = 100))
draw.circle(0, 0, 1, border="red")

which gives

enter image description here

Using a te() smoother seems to do a bit better than s() and I used method = "REML" as this can help with situations where the objective function in GCV/UBRE-based selection can become flat (and hence these methods can undersmooth), in case that was the problem here.

Related Solutions

Solved – When and why would you not want to use a GAM

You could take this to extreme and ask why wouldn't we use non-parametric model like $k$-NN regression? Actually, the opposite question Why would anyone use KNN for regression? was asked, and you can check it for more detailed discussion. You can also make the question more broad and ask why wouldn't we use more complicated models instead of simpler ones? For example, why would anyone use logistic, or linear regression, if they could use a neural network?

The two main reasons for preferring simple models are:

Interpretability. Simple models like linear regression are directly interpretable, while this does not have to be the case of more complicated models. This may be desirable in some disciplines (e.g. medicine), and even obligatory by law in others (finance).
Overfitting. More complicated models are more prone to overfitting, especially for small sample sizes. Complicated model may simply memorize the training dataset and not generalize.

As noticed in the comments, this seems to be also discussed in the following thread: When to use a GAM vs GLM.

As a comment, notice that using model that is linear in parameters is not that a big constraint. You can easily extend a linear model using polynomial components to model complex relationships, and this may even outperform neural networks in some cases (see Cheng et al, 2018 [arXiv:1806.0685]).

r – Investigating Why GAM Fits Change Significantly with Random Effects in mgcv

You might be better off using the tw() family to fit a Tweedie model, at least then you'll have a proper likelihood and can use things like AIC etc to compare fits.

You're generating your confidence intervals incorrectly. That they include negative values should've been a big warning sign as those values simply aren't plausible values. You should use predict(...., type = "link", se.fit = TRUE), compute the confidence interval on the link scale, and then transform the fitted values and the upper and lower limits of the confidence interval back to the response scale using the inverse of the link function (here you would just use exp() on the fitted values and upper and lower interval values).

If you have 94 subjects and only 123 observations, that implies that for most of the subjects you have a single observation. By including the random effect you are soaking up (modelling) some of the variation in the response that is due to individuals (subjects) but because variation between subjects is pretty much all you have in your data set (you have very little data to inform the within-Subject part of the model) there's little left for the effect of time to model. Hence the flat fitted lines.

Try plotting your data faceted by Subject:

ggplot(ratios_wide, aes(x = time, y = ratio_SII)) +
  geom_point() +
  facet_wrap( ~ id)

and see if there is much in the way of change over time in those plots. That should help you understand why you get such different time smooths when you include or exclude the random effect.

Also, the name of your response variable ratio_SII implies you've divided your original response variable by another value. If you did, can you explain what the original data and the thing you divided it by are / represent? This is often a mistake I see people make where they start with a count response but need to normalise it by some other variable (to account for effort or some such) and so they divide the nice integer count data they had by this value and end up having to model a now continuous variable which leads them to doing things like quasipoisson models, when they could have just stuck with the original count data, fitted a Poisson model, and used a offset to account for the thing they need to normalise their data by...

Best Answer

Related Solutions

Solved – When and why would you not want to use a GAM

r – Investigating Why GAM Fits Change Significantly with Random Effects in mgcv

Related Question