mgcv uses a thin plate spline basis as the default basis for it's smooth terms. To be honest it likely makes little difference in many applications which of these you choose, though in some situations or with very large data set sizes, other basis types might be used to good effect. Thin plate splines tend to have better RMSE performance than the other three you mention but are more computationally expensive to set up. Unless you have a reason to use the P or B spline bases, use thin plate splines unless you have a lot of data and if you have a lot of data consider the cubic spline option.
k
doesn't set the number of knots, at least not in the default thin plate spline basis. What k
does is to set the dimensionality of the basis expansion; you'll end up with k - 1
basis functions. In mgcv Simon Wood does a trick to reduce the rank of basis dimension. IIRC, in the usual thin plate spline basis there is a knot at each data location, but this is wasteful as once you've set up this large basis you end up using far fewer degrees of freedom in the fitted function. What Simon does is to eigen decompose the matrix of basis functions and choose the eigenvectors of the decomposition corresponding to the k - 1
largest eigenvalues. This has the effect of concentrating the main wiggliness "information" of the full basis in a reduced rank form.
The choice of k
is important and the default is arbitrary and something you want to check (see gam.check()
), but the critical observation is that you want to set k
to be large enough to contain the envisioned dimensionality of the underlying function you are trying to recover from the data. In practice, one tends to fit with a modest k
given the data set size and then use gam.check()
on the resulting model to check if k
was large enough. If it wasn't, increase k
and refit. Rinse and repeat...
You are most likely going to want to fit the model using REML (or ML) smoothness selection via method = "REML"
or method = "ML"
: this treats the model as a mixed effects one with the wiggly parts of the spline bases being treated as special random effects terms. Simon Wood has shown that REML (or ML) selection performs better than GCV, which can undersmooth in situations where the objective function is flat around the optimal smoothness parameter value.
The ridge penalty mentioned by @generic_user is taken care of for you, so you can ignore this part of setting up the model.
It's not flawed, per se, just pointless. I can think of (and have experienced) situations where the specific type of basis (of the two you mention) can result in markedly different fits. However, where this has happened to me, it has usually been solved by increasing k
for one of the bases or because the wrong model was fitted. In this instances trivial differences in the basis used were magnified by the real problem (needing larger basis dimension, fitting the right model), not because of any fundamental difference in the performance of the individual basis.
In most situations you are going to see trivial differences in the fits of models fitted with different basis (among standard bases) and these are going to result in trivial differences in AIC. In most cases AIC is going to tell you, therefore, that the model fits are equivalent.
If you are planning on using SCAM models, I might suggest you use P splines in the GAMs as the splines in the scam package are all based on P splines.
Best Answer
You could do what you want for linear terms using the
paraPen
argument togam()
, which allows penalties on parametric terms.However, why not treat the linear terms as low-degree smooths (say
k = 3
) and let the double penalty work on it too?For the categorical terms, I'd just leave them alone; I'm not sure it is possible to apply a group penalty to categories using
paraPen
. For something likeyear
, it is highly unlikely that it will have a zero effect (all years exactly the same). I'd be inclined to either:year
as categorical and just leave it alone penalty-wise, so you control for between year differences in the expectation of the response, ors(year)
.