Solved – How to tune smoothing in mgcv GAM model

mgcvrsmoothing

I am trying to figure out how to control the smoothing parameters in an mgcv::gam model.

I have a binomial variable I am trying to model as primarily a function of x and y coordinates on a fixed grid, plus some other variables with more minor influences. In the past I have constructed a reasonably good local regression model using package locfit and just the (x,y) values.

However, I want to try incorporating the other variables into the model, and it looked like generalized additive models (GAM) were a good possibility. After looking at packages gam and mgcv, both of which have a GAM function, I opted for the latter since a number of comments in mailing list threads seem to recommend it. One downside is that it doesn't seem to support a local regression smoother like loess or locfit.

To start, I just wanted to try to replicate approximately the locfit model, using just (x,y) coordinates. I tried with both regular and tensor product smooths:

my.gam.te <- gam(z ~ te(x, y), 
      family=binomial(logit), data=my.data, 
      scale = -1)  

my.gam.s  <- gam(z ~  s(x, y), 
      family=binomial(logit), data=my.data, 
      scale = -1)

However, plotting the predictions from the model, they are much much more smoothed compared to the locfit model. So I've been trying to tune the model to not oversmooth as much. I've tried adjusting the parameters sp and k, but it's not clear to me how they affect the smoothing. In locfit, the nn parameter controls the span of the neighborhood used, with smaller values allowing for less smoothing and more "wiggling", which helps to capture some areas on the grid where the probability of the binomial outcomes changes rapidly. How would I go about setting up the gam model to enable it to behave similarly?

Best Answer

The k argument effectively sets up the dimensionality of the smoothing matrix for each term. gam() is using a GCV or UBRE score to select an optimal amount of smoothness, but it can only work within the dimensionality of the smoothing matrix. By default, te() smooths have k = 5^2 for 2d surfaces. I forget what it is for s() so check the documents. The current advice from Simon Wood, author of mgcv, is that if the degree of smoothness selected by the model is at or close to the limit of the dimensionality imposed by the value used for k, you should increase k and refit the model to see if a more complex model is selected from the higher dimensional smoothing matrix.

However, I don't know how locfit works, but you do need to have something the stops you from fitting too complex a surface (GCV and UBRE, or (RE)ML if you choose to use them [you can't as you set scale = -1], are trying to do just that), that is not supported by the data. In other words, you could fit very local features of the data but are you fitting the noise in the sample of data you collected or are you fitting the mean of the probability distribution? gam() may be telling you something about what can be estimated from your data, assuming that you've sorted out the basis dimensionality (above).

Another thing to look at is that the smoothers you are currently using are global in the sense that the smoothness selected is applied over the entire range of the smooth. Adaptive smoothers can spend the allotted smoothness "allowance" in parts of the data where the response is changing rapidly. gam() has capabilities for using adaptive smoothers.

See ?smooth.terms and ?adaptive.smooth to see what can be fitted using gam(). te() can combine most if not all of these smoothers (check the docs for which can and can't be included in tensor products) so you could use an adaptive smoother basis to try to capture the finer local scale in the parts of the data where the response is varying quickly.

I should add, that you can get R to estimate a model with a fixed set of degrees of freedom used by a smooth term, using the fx = TRUE argument to s() and te(). Basically, set k to be what you want and fx = TRUE and gam() will just fit a regression spline of fixed degrees of freedom not a penalised regression spline.

Related Solutions

Solved – Why does this logistic GAM fit so poorly

You are ignoring the model intercept when evaluating the model fit. The plot method shows the fitted spline, but the model includes a parametric constant term, just like the intercept in a standard logistic regression model.

Instead, predict from the fitted model using the predict() method for locations on a grid of locations over the interval. For example:

m.gam <- gam(inside ~ te(x, y), data=df, family=binomial, method = "REML")
locs <- with(df,
             data.frame(x = seq(min(x), max(x), length = 100),
                        y = seq(min(y), max(y), length = 100)))
pred <- expand.grid(locs)
pred <- transform(pred,
                  fitted = predict(m.gam, newdata = pred, type = "response"))
contour(locs$x, locs$y, matrix(pred$fitted, ncol = 100))
draw.circle(0, 0, 1, border="red")

which gives

enter image description here

Using a te() smoother seems to do a bit better than s() and I used method = "REML" as this can help with situations where the objective function in GCV/UBRE-based selection can become flat (and hence these methods can undersmooth), in case that was the problem here.

Solved – How to properly smooth a 2d map

I do wonder if there is a big problem here at all? If you look at the very white pixels in your upper figure, they are surrounded by points that are less intensely-coloured (?) white. What you have here are data and the fitted spline is balancing the highly successful point with the less successful ones in the same spatial location.

A basis dimensionality of 15 sounds far too few for a 2-d thin plate or tensor product spline. In this are you referring to the k in s(x, y, k = 15) or te(x, y, k = 15), or did you set k to something else? If the former, then you actually have 15*15 as the basis dimension, which would seem more reasonable as a starting point: the smoothness selection will reduce this via the wiggliness penalty.

As you have both known 0 and 1's, I would start, if using mgcv, using the actual coordinates of the shots (not aggregated to pixels) with:

m <- gam(scored ~ te(x, y), data = myDF, family = binomial, method = "REML")

and then check the resulting fit for adequacy, especially via gam.check(). This will tell you whether the basis dimension for the te() term was high enough or not (look for K close to 1 or a high p value in the printed output). You can also assess the model diagnostic plots but some of these may be less than useful as the response is binary.

If gam.check() suggests too low a basis dimension (the default is $2^5$) and you should increase k.

It might be that you need an adaptive smoother: you can do te(x, y, bs = "ad") but this will impact the computational efficiency a lot as this is effectively going to fit several different GAMs to section of the data.

If you are on Linux you could consider setting the nthreads option via gam.control to a number equal to the available CPU cores (or thereabouts) in the gam() call: control = gam.control(nthreads = 4). This will utilise multiple threads in some parts of the computation and can for some models greatly decrease the compute time.

You could also look at bam() as a drop-in replacement for gam(), which is designed to work on big data sets.

Best Answer

Related Solutions

Solved – Why does this logistic GAM fit so poorly

Solved – How to properly smooth a 2d map

Related Question