MGCV in R – Choosing the Optimal Value of k in gam() Function

mgcv

This post (link below) alludes to setting the basis dimension to -1 (k = -1) as automatically choosing the number of knots via Generalized Cross Validation (GCV) in R's mgcv package:

Selecting knots for a GAM

except that k is NOT the number of knots.

I've only come across this specific instance of setting k = -1.

Is there really any benefit to doing so?

The reason I ask this is because I am generating hundreds of simulated datasets, and it would be impossible (impractical) to find the optimal k via gam.check() for each dataset.

So, I'd like to know the merits of simply setting k = -1 with method = "GCV.Cp",to avoid having to check histograms and QQplots for each my simulated data sets.

Of course, setting k to an arbitrarily high value is computationally costly, as the right k is only meant to capture true variation in the data.

Any advice would be most welcomed.

Best Answer

There is some confusion here and in the answer by @Ira S in that linked post. The default value of the argument k is -1. This indicates that the default number of basis functions be computed for the specified basis type (the default is thin plate splines but you can ask for others via the bs argument). So for a univariate thin plate spline you will get 10 basis functions by default because k = -1 implies a default of 10, and in reality you get 9 basis functions as the constant basis function, which is confounded with the model intercept term, gets removed from the basis by the application of a sum-to-zero identifiabilty constraint.

Given a basis expansion, mgcv::gam() will fit the required model using penalised likelihood to estimate parameters for the basis functions and the intercept and any other parametric terms, conditional upon a smoothness parameter, and also estimate the smoothness parameter which is what actually chooses the complexity (wiggliness) of the final fitted smooth function.

mgcv::gam() can use GCV, or REML, or ML to estimate the coefficients and smoothness parameter(s) of the model. It will do this estimation for you whatever value you pass to k. You can only stop it doing this smoothness selection by adding the argument fx = TRUE to the s() call for each smooth.

With mgcv::gam() the main issue you face is to set the initial basis size. You don't need to choose knot locations with the thin plate splines (there is a knot at each unique data value, and then a low-rank version of the full basis expansion with k basis functions is found) and for most of the less exotic bases, knot placement typically makes little to no difference on the fitted model.

You want to set k to be large, as large as you can afford given the amount of data you have, but you don't want it to be too large as it takes a lot more computational effort to work with all those basis functions especially if many/most of them will be penalised away to zero in the resulting model fit.

So, in your case, I would set k to be some large enough value that the expected wiggliness of the true function is accommodated. If you have lots of data and can bear the computational burden, you can effectively set this as high as you you want.

Assuming you have the correct model specified, the penalty should deal with the extra wiggliness.

I have found GCV to be a little bit more robust to model mis-specification for some models I have fitted, I prefer to use REML for smoothness selection and this will become the default in a future version of mgcv so I recommend that you use that, not GCV.

Related Question