Overfitting comes from allowing too large a class of models. This gets a bit tricky with models with continuous parameters (like splines and polynomials), but if you discretize the parameters into some number of distinct values, you'll see that increasing the number of knots/coefficients will increase the number of available models exponentially. For every dataset there is a spline and a polynomial that fits precisely, so long as you allow enough coefficients/knots. It may be that a spline with three knots overfits more than a polynomial with three coefficients, but that's hardly a fair comparison.
If you have a low number of parameters, and a large dataset, you can be reasonably sure you're not overfitting. If you want to try higher numbers of parameters you can try cross validating within your test set to find the best number, or you can use a criterion like Minimum Description Length.
EDIT: As requested in the comments, an example of how one would apply MDL. First you have to deal with the fact that your data is continuous, so it can't be represented in a finite code. For the sake of simplicity we'll segment the data space into boxes of side $\epsilon$ and instead of describing the data points, we'll describe the boxes that the data falls into. This means we lose some accuracy, but we can make $\epsilon$ arbitrarily small, so it doesn't matter much.
Now, the task is to describe the dataset as sucinctly as possible with the help of some polynomial. First we describe the polynomial. If it's an n-th order polynomial, we just need to store (n+1) coefficients. Again, we need to discretize these values. After that we need to store first the value $n$ in prefix-free coding (so we know when to stop reading) and then the $n+1$ parameter values. With this information a receiver of our code could restore the polynomial. Then we add the rest of the information required to store the dataset. For each datapoint we give the x-value, and then how many boxes up or down the data point lies off the polynomial. Both values we store in prefix-free coding so that short values require few bits, and we won't need delimiters between points. (You can shorten the code for the x-values by only storing the increments between values)
The fundamental point here is the tradeoff. If I choose a 0-order polynomial (like f(x) = 3.4), then the model is very simple to store, but for the y-values, I'm essentially storing the distance to the mean. More coefficients give me a better fitting polynomial (and thus shorter codes for the y values), but I have to spend more bits describing the model. The model that gives you the shortest code for your data is the best fit by the MDL criterion.
(Note that this is known as 'crude MDL', and there are some refinements you can make to solve various technical issues).
You are right that the imputation model needs to be as rich or richer than the outcome model. The fact that imputation based on full maximum likelihood estimation and imputation done by mice
assume linearity everywhere was a prime reason I wrote the R Hmisc
package aregImpute
function, which creates imputation models automatically using rich additive restricted cubic spline models. So linearity is not assumed for multiple imputation. The default approach in aregImpute
is predictive mean matching, which I generally prefer over more parametric approaches (splines are still used; PMM is less parametric on the left hand side of models).
Like mice
, aregImpute
uses chained equations. Unlike mice
, it uses bootstrap draws instead of approximate (assuming multivariate normality) Bayesian posterior draws.
Best Answer
Your graphs indeed look (to me) like four or five knots may entail some slight overfitting, and I personally would tend to use three.
If you want a more formal procedure, Frank Harrell in his Regression Modeling Strategies (section 2.4.5) suggests using Akaike's Information Criterion (AIC). Alternatively, you could cross-validate and pick the number yielding the lowest error.