Overfitting comes from allowing too large a class of models. This gets a bit tricky with models with continuous parameters (like splines and polynomials), but if you discretize the parameters into some number of distinct values, you'll see that increasing the number of knots/coefficients will increase the number of available models exponentially. For every dataset there is a spline and a polynomial that fits precisely, so long as you allow enough coefficients/knots. It may be that a spline with three knots overfits more than a polynomial with three coefficients, but that's hardly a fair comparison.
If you have a low number of parameters, and a large dataset, you can be reasonably sure you're not overfitting. If you want to try higher numbers of parameters you can try cross validating within your test set to find the best number, or you can use a criterion like Minimum Description Length.
EDIT: As requested in the comments, an example of how one would apply MDL. First you have to deal with the fact that your data is continuous, so it can't be represented in a finite code. For the sake of simplicity we'll segment the data space into boxes of side $\epsilon$ and instead of describing the data points, we'll describe the boxes that the data falls into. This means we lose some accuracy, but we can make $\epsilon$ arbitrarily small, so it doesn't matter much.
Now, the task is to describe the dataset as sucinctly as possible with the help of some polynomial. First we describe the polynomial. If it's an n-th order polynomial, we just need to store (n+1) coefficients. Again, we need to discretize these values. After that we need to store first the value $n$ in prefix-free coding (so we know when to stop reading) and then the $n+1$ parameter values. With this information a receiver of our code could restore the polynomial. Then we add the rest of the information required to store the dataset. For each datapoint we give the x-value, and then how many boxes up or down the data point lies off the polynomial. Both values we store in prefix-free coding so that short values require few bits, and we won't need delimiters between points. (You can shorten the code for the x-values by only storing the increments between values)
The fundamental point here is the tradeoff. If I choose a 0-order polynomial (like f(x) = 3.4), then the model is very simple to store, but for the y-values, I'm essentially storing the distance to the mean. More coefficients give me a better fitting polynomial (and thus shorter codes for the y values), but I have to spend more bits describing the model. The model that gives you the shortest code for your data is the best fit by the MDL criterion.
(Note that this is known as 'crude MDL', and there are some refinements you can make to solve various technical issues).
Best Answer
From my reading, the two concepts you ask us to compare are quite different beasts and would require an apples and oranges-like comparison. This makes many of your questions somewhat moot — ideally (assuming one can write a wiggliness penalty down for the RCS basis in the required form) you'd use a penalised restricted cubic regression spline model.
Restricted Cubic Splines
A restricted cubic spline (or a natural spline) is a spline basis built from piecewise cubic polynomial functions that join smoothly at some pre-specified locations, or knots. What distinguishes a restricted cubic spline from a cubic spline is that additional constraints are imposed on the restricted version such that the spline is linear before the first knot and after the last knot. This is done to improve performance of the spline in the tails of $X$.
Model selection with an RCS typically involves choosing the number of knots and their location, with the former governing how wiggly or complex the resulting spline is. Unless some further steps are in place to regularize the estimated coefficients when model fitting, then the number of knots directly controls spline complexity.
This means that the user has some problems to overcome when estimating a model containing one or more RCS terms:
On their own, RCS terms require user intervention to solve these problems.
Penalized splines
Penalized regression splines (sensu Hodges) on their own tackle issue 3. only, but they allow for issue 1. to be circumvented. The idea here is that as well as the basis expansion of $X$, and for now let's just assume this is a cubic spline basis, you also create a wiggliness penalty matrix. Wiggliness is measured using some derivative of the estimated spline, with the typical derivative used being the second derivative, and the penalty itself represents the squared second derivative integrated over the range of $X$. This penalty can be written in quadratic form as
$$\boldsymbol{\beta}^{\mathsf{T}} \boldsymbol{S} \boldsymbol{\beta}$$
where $\boldsymbol{S}$ is a penalty matrix and $\boldsymbol{\beta}$ are the model coefficients. Then coefficient values are found to maximise the penalised log-likelihood $\mathcal{L}_p$ ceriterion
$$\mathcal{L}_p = \mathcal{L} - \lambda \boldsymbol{\beta}^{\mathsf{T}} \boldsymbol{S} \boldsymbol{\beta}$$
where $\mathcal{L}$ is the log-likelihood of the model and $\lambda$ is the smoothness parameter, which controls how strongly to penalize the wiggliness of the spline.
As the penalised log-likelihood can be evaluated in terms of the model coefficients, fitting this model effectively becomes a problem in finding an optimal value for $\lambda$ whilst updating the coefficients during the search for that optimal $\lambda$.
$\lambda$ can be chosen using cross-validation, generalised cross-validation(GCV), or marginal likelihood or restricted marginal likelihood criteria. The latter two effectively recast the spline model as a mixed effects model (the perfectly smooth parts of the basis become fixed effects and the wiggly parts of the basis are random effects, and the smoothness parameter is inversely related to the variance term for the random effects), which is what Hodges is considering in his book.
Why does this solve the problem of how many knots to use? Well, it only kind of does that. This solves the problem of not requiring a knot at every unique data point (a smoothing spline), but you still need to choose how many knots or basis functions to use. However, because the penalty shrinks the coefficients you can get away with choosing as large a basis dimension as you think is needed to contain either the true function or a close approximation to it, and then you let the penalty control how wiggly the estimated spline ultimately is, with the extra potential wiggliness available in the basis being removed or controlled by the penalty.
Comparison
Penalized (regression) splines and RCS are quite different concepts. There is nothing stopping you creating a RCS basis and an associated penalty in quadratic form and then estimating the spline coefficients using the ideas from the penalized regression spline model.
RCS is just one kind of basis you can use to create a spline basis, and penalized regression splines are one way to estimate a model containing one or more splines with associated wiggliness penalties.
Can we avoid issues 1., 2., and 3.?
Yes, to some extent, with a thin plate spline (TPS) basis. A TPS basis has as many basis functions as unique data values in $X$. What Wood (2003) showed was that you can create a Thin Plate Regression Spline (TPRS) basis uses an eigendecomposition of the the TPS basis functions, and retaining only the first $k$ largest say. You still have to specify $k$, the number of basis functions you want to use, but the choice is generally based on how wiggly you expect the fitted function to be and how much computational hit you are willing to take. There is no need to specify the knot locations either, and the penalty shrinks the coefficients so one avoids the model selection problem as you only have one penalized model not many unpenalized ones with differing numbers of knots.
P-splines
Just to make things more complicated, there is a type of spline basis known as a P-spline (Eilers & Marx, 1996)), where the $P$ often gets interpreted as "penalized". P-splines are a B-spline basis with a difference penalty applied directly to the model coefficients. In typical use the P-spline penalty penalizes the squared differences between adjacent model coefficients, which in turn penalises wiggliness. P-splines are very easy to set-up and result in a sparse penalty matrix which makes them very amenable to estimation of spline terms in MCMC based Bayesian models (Wood, 2017).
References
Eilers, P. H. C., and B. D. Marx. 1996. Flexible Smoothing with -splines and Penalties. Stat. Sci.
Wood, S. N. 2003. Thin plate regression splines. J. R. Stat. Soc. Series B Stat. Methodol. 65: 95–114. doi:10.1111/1467-9868.00374
Wood, S. N. 2017. Generalized Additive Models: An Introduction with R, Second Edition, CRC Press.