But, in my opinion wouldn't that be overfitting?
No.
Your equation explains it all.
$$\underbrace{\sum_{i=1}^n(y_i-g(x_i))^2}_\text{residual squares}+\underbrace{\lambda\int g''(t)^2dt}_\text{roughness penalty}$$
The second part $\lambda\int g''(t)^2dt$ is often called a roughness penalty, and $\lambda$ - roughness coefficient. The idea here is that first and second parts are competing. Think of this, if you make your function $g(x_i)=y_i$, i.e. go through each point exactly, then $\sum_{i=1}^n(y_i-g(x_i))^2=0$, but it usually leads to the function being very bumpy, it goes up and down trying to pass through each observation, which have noise in them. This would increase the contribution of the right part because generally $g''(x)$ will be higher, and depending on $\lambda$ the second part may become very large. Note, that $g''(x)$ is an approximation of the curvature of the spline.
So, you may find a curve that doesn't go exactly through each point $g(x_i)\ne y_i$ and $\sum_{i=1}^n(y_i-g(x_i))^2>0$, but your function becomes less bumpy, more smooth so that $g''(x)$ becomes smaller, and the increase in the first part is compensated by the decrease of the second part. Therefore, the roughness penalty does what shrinkage does, it actually cures overfitting.
Note, that the equation you gave is not the only possible way to build the smoothing spline. It's probably the simplest and most intuitive one. You could replace the second part with something different, e.g. $\lambda\int g'(t)^2dt$ would lead to the Laplacian kernel. It minimizes the length of the smooth curve.
The example actually has a simple physical representation. So let's start with an ordinary spline. Imagine that we nail a ring to the board at coordinates $x_i,y_i$, then we pass a flat spline through each ring. Now the shape of the flat spline is what you get from an ordinary (cubic) spline. Here how it looks (pic is from Wiki):
Now, instead of the ring, we nail springs into the same point. Then we attach the spline to the spring. Since the springs can extend the spline no longer will go through each observation! It'll relax a bit. What defines the shape of the new spline? The competition between the potential energy of the springs and the energy of tension in the flat spline. The more you bend the flat spline the more energy is in its tension, just like with a spring extension.
So, if you recall what is potential energy of a spring, it's just a square of its extension, which is given by the error (residual) $e_i=y_y-g(x_i)$, i.e. the sum of squares in the first part of your smoothing spline equation:
Now the second part of your equation gives the potential energy of the tension in the spline. In my example $\lambda\int g'(t)^2dt$ represents an approximation of the length of the spline. So, the shape of the spline will be the one that minimizes the total potential energy (in your case) or sum of the potential energy of spring extensions and the length of the spline (in my example).
Best Answer
I was hoping to find a more intuitive proof, but the matrix can be shown to be invertible by directly showing that its determinant is nonzero.
First note that the knots are located at the $K$ unique values of $\mathbf{x}$. Let $\boldsymbol{\xi}$ represent these knots, in ascending order.
Also, $d_j(\xi_i) = 0$ if $i \le j$, and so $N_{i,j} = 0$ if $i > j + 1$ and so $\mathbf{N}$ is "almost lower diagonal". In fact, it's a lower diagonal matrix with a column of 1's appended to the front of it.
We can take advantage of this structure to obtain a simple expression for the determinant. First we transform $\mathbf{N}$ by subtracting the first row from all others. Call this matrix $\mathbf{M}$. (These row operations do not affect the determinant, so $\text{det}(\mathbf{M}) = \text{det}(\mathbf{N})$.)
We have: $$ \begin{aligned} \mathbf{M} &= \begin{pmatrix} 1 & \xi_1 & 0 & 0 & \cdots & 0 \\ 0 & \xi_2 - \xi_1 & 0 & 0 & \cdots & 0 \\ 0 & \xi_3 - \xi_1 & N_3(\xi_3) & 0 & \cdots & 0 \\ 0 & \xi_4 - \xi_1 & N_3(\xi_4) & N_4(\xi_4) & \cdots & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & \xi_K - \xi_1 & N_3(\xi_K) & N_4(\xi_K) & \cdots & N_K(\xi_K) \end{pmatrix} \\ \end{aligned} $$
By expanding along the first column, we see that the determinant of $\mathbf{M}$ is equal to that of the sub-matrix formed by deleting the first column and the first row. Because this sub-matrix is lower diagonal, the determinant is the product of its diagonal entries:
$$ \text{det}(\mathbf{M}) = (\xi_2 - \xi_1) \prod_{i=3}^K N_i(\xi_i) $$
We know that $\xi_2 - \xi_1 \ne 0$. Thus it suffices to show that $N_i(\xi_i) \ne 0$ for all $i \in \{3, \dots, K\}$, as follows:
\begin{aligned} N_i(\xi_i) &= d_{i-2}(\xi_i) - d_{K-1}(\xi_i) \\ &= \frac{(\xi_i - \xi_{i-2})_+^3 - (\xi_i-\xi_{K})_+^3}{\xi_{K}-\xi_{i-2}} - \frac{(\xi_i-\xi_{K-1})_+^3 - (\xi_i-\xi_{K})_+^3}{\xi_{K}-\xi_{K-1}} \\ &= \frac{(\xi_i - \xi_{i-2})_+^3}{\xi_{K}-\xi_{i-2}} - \frac{(\xi_i-\xi_{K-1})_+^3}{\xi_{K}-\xi_{K-1}} \\ \end{aligned}
If $i < K$, then we have $$ N_i(x_i) = \frac{(\xi_i - \xi_{i-2})_+^3}{\xi_{K}-\xi_{i-2}} \ne 0 $$ by the fact that the $\xi_i$ are strictly increasing. If $i = K$, then we have \begin{aligned} N_K(x_K) &= (\xi_K - \xi_{K-2})^2 - (\xi_K - \xi_{K-1})^2 \\ &> (\xi_K - \xi_{K-2})^2 - (\xi_K - \xi_{K-2})^2 = 0 \end{aligned}
So $N_K(x_K) \ne 0$ and the proof is complete.