Solved – How to tune smoothing in mgcv GAM model

mgcvrsmoothing

I am trying to figure out how to control the smoothing parameters in an mgcv::gam model.

I have a binomial variable I am trying to model as primarily a function of x and y coordinates on a fixed grid, plus some other variables with more minor influences. In the past I have constructed a reasonably good local regression model using package locfit and just the (x,y) values.

However, I want to try incorporating the other variables into the model, and it looked like generalized additive models (GAM) were a good possibility. After looking at packages gam and mgcv, both of which have a GAM function, I opted for the latter since a number of comments in mailing list threads seem to recommend it. One downside is that it doesn't seem to support a local regression smoother like loess or locfit.

To start, I just wanted to try to replicate approximately the locfit model, using just (x,y) coordinates. I tried with both regular and tensor product smooths:

my.gam.te <- gam(z ~ te(x, y), 
      family=binomial(logit), data=my.data, 
      scale = -1)  

my.gam.s  <- gam(z ~  s(x, y), 
      family=binomial(logit), data=my.data, 
      scale = -1)

However, plotting the predictions from the model, they are much much more smoothed compared to the locfit model. So I've been trying to tune the model to not oversmooth as much. I've tried adjusting the parameters sp and k, but it's not clear to me how they affect the smoothing. In locfit, the nn parameter controls the span of the neighborhood used, with smaller values allowing for less smoothing and more "wiggling", which helps to capture some areas on the grid where the probability of the binomial outcomes changes rapidly. How would I go about setting up the gam model to enable it to behave similarly?

Best Answer

The k argument effectively sets up the dimensionality of the smoothing matrix for each term. gam() is using a GCV or UBRE score to select an optimal amount of smoothness, but it can only work within the dimensionality of the smoothing matrix. By default, te() smooths have k = 5^2 for 2d surfaces. I forget what it is for s() so check the documents. The current advice from Simon Wood, author of mgcv, is that if the degree of smoothness selected by the model is at or close to the limit of the dimensionality imposed by the value used for k, you should increase k and refit the model to see if a more complex model is selected from the higher dimensional smoothing matrix.

However, I don't know how locfit works, but you do need to have something the stops you from fitting too complex a surface (GCV and UBRE, or (RE)ML if you choose to use them [you can't as you set scale = -1], are trying to do just that), that is not supported by the data. In other words, you could fit very local features of the data but are you fitting the noise in the sample of data you collected or are you fitting the mean of the probability distribution? gam() may be telling you something about what can be estimated from your data, assuming that you've sorted out the basis dimensionality (above).

Another thing to look at is that the smoothers you are currently using are global in the sense that the smoothness selected is applied over the entire range of the smooth. Adaptive smoothers can spend the allotted smoothness "allowance" in parts of the data where the response is changing rapidly. gam() has capabilities for using adaptive smoothers.

See ?smooth.terms and ?adaptive.smooth to see what can be fitted using gam(). te() can combine most if not all of these smoothers (check the docs for which can and can't be included in tensor products) so you could use an adaptive smoother basis to try to capture the finer local scale in the parts of the data where the response is varying quickly.

I should add, that you can get R to estimate a model with a fixed set of degrees of freedom used by a smooth term, using the fx = TRUE argument to s() and te(). Basically, set k to be what you want and fx = TRUE and gam() will just fit a regression spline of fixed degrees of freedom not a penalised regression spline.

Related Question