I am trying to create a logistic regression model with mgcv::gam
with what I think is a simple decision boundary, but the model
I build performs very poorly. A local regression model built
using locfit::locfit on the same data finds the boundary very easily.
I want to add additional parametric regressors to my real-life model, so
I do not want to switch to a purely local regression.
I want to understand why GAM is having trouble fitting the data,
and whether there was ways of specifying the smooths that
can perform better.
Here's a simplified, reproducible example:
Ground truth is 1 = point lies within the unit circle, 0 if outside
e.g. z = 1 if sqrt(x^2 + y^2) <= 1, 0 otherwise
The observed data is noisy, with both false positives and false negatives
Construct a logistic regression to predict whether a point
is inside the circle or not, based on the point's Cartesian
coordinates.
Local regression can find the boundary well (50% probability contour
is very close to the unit circle), but a logistic GAM consistently
overestimates the size of the circle for the same probability band.
library(ggplot2)
library(locfit)
library(mgcv)
library(plotrix)
set.seed(0)
radius <- 1 # actual boundary
n <- 10000 # data points
jit <- 0.5 # noise factor
# Simulate random data, add polar coordinates
df <- data.frame(x=runif(n,-3,3), y=runif(n,-3,3))
df$r <- with(df, sqrt(x^2+y^2))
df$theta <- with(df, atan(y/x))
# Noisy indicator for inside the boundary
df$inside <- with(df, ifelse(r < radius + runif(nrow(df),-jit,jit),1,0))
# Plot data, shows ragged edge
(ggplot(df, aes(x=x, y=y, color=inside)) + geom_point() + coord_fixed() + xlim(-4,4) + ylim(-4,4))
### Model boundary condition using x,y coordinates
### local regression finds the boundary pretty accurately
m.locfit <- locfit(inside ~ lp(x,y, nn=0.3), data=df, family="binomial")
plot(m.locfit, asp=1, xlim=c(-2,-2,2,2))
draw.circle(0,0,1, border="red")
### But GAM fits very poorly, also tried with fx=TRUE but didn't help
m.gam <- gam(inside ~ s(x,y), data=df, family=binomial)
plot(m.gam, trans=plogis, se=FALSE, rug=FALSE)
draw.circle(0,0,1, border="red")
### gam.check doesn't indicate a problem with the model itself
gam.check(m.gam)
Method: UBRE Optimizer: outer newton
full convergence after 8 iterations.
Gradient range [5.41668e-10,5.41668e-10]
(score -0.815746 & scale 1).
Hessian positive definite, eigenvalue range [0.0002169789,0.0002169789].
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
k' edf k-index p-value
s(x,y) 29.000 13.795 0.973 0.08
#### Try using polar coordinates
### Again, locfit works well
m.locfit2 <- locfit(inside ~ lp(r, nn=0.3), data=df, family="binomial")
plot(m.locfit2)
abline(v=1, col="red")
### But GAM misses again
m.gam2 <- gam(inside ~ s(r, k=50), data=df, family=binomial)
plot(m.gam2, se=FALSE, rug=FALSE, trans=plogis)
abline(v=1, col="red")
### Can also plot gam on link scale for alternate view
plot(m.gam2, se=FALSE, rug=FALSE)
abline(v=1, col="red")
gam.check(m.gam2)
Method: UBRE Optimizer: outer newton
full convergence after 4 iterations.
Gradient range [-3.29203e-08,-3.29203e-08]
(score -0.8240065 & scale 1).
Hessian positive definite, eigenvalue range [7.290233e-05,7.290233e-05].
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
k' edf k-index p-value
s(r) 49.000 10.537 0.979 0.06
Best Answer
You are ignoring the model intercept when evaluating the model fit. The
plot
method shows the fitted spline, but the model includes a parametric constant term, just like the intercept in a standard logistic regression model.Instead, predict from the fitted model using the
predict()
method for locations on a grid of locations over the interval. For example:which gives
Using a
te()
smoother seems to do a bit better thans()
and I usedmethod = "REML"
as this can help with situations where the objective function in GCV/UBRE-based selection can become flat (and hence these methods can undersmooth), in case that was the problem here.