If the locations of the knots (breakpoints) would be constants, the joint distribution of $x,x+\delta$ would depend on whether there is a breakpoint in the interval, not only on $\delta$ as stationarity requires. Therefore, I assume the question intends the knots to be random, as well. In that case, obtaining piecewise linear sample paths with positive probability is not possible, as shown below.
I consider the one-dimensional case and denote the value of the Gaussian process at input $x_k$ by $f(x_k)$.
If the sample paths are piecewise linear with random breakpoint locations, we must have three distinct inputs $x_1,x_2,x_3$ so that the event 'The three points $(x_1,f(x_1)),(x_2,f(x_2)),(x_3,f(x_3))$ are on the same line' has a probability that is nonzero and below one.
\begin{equation}
\frac{f(x_2) - f(x_1)}{x_2 - x_1} = \frac{f(x_3) - f(x_2)}{x_3 - x_2}
\end{equation}
or equivalently
\begin{equation}
(x_2 - x_3)\,f(x_1) + (x_3 - x_1)\, f(x_2) + (x_1 - x_2)\,f(x_3) = 0.
\end{equation}
That is, the probability of the points being on the same line is $Pr(Z=0)$ where
\begin{equation}
Z := (x_2 - x_3)\,f(x_1) + (x_3 - x_1)\, f(x_2) + (x_1 - x_2)\,f(x_3).
\end{equation}
Note that $Z$ is a linear combination of $f(x_1),f(x_2),f(x_3)$. Since $f$ is a Gaussian process, the joint distribution of $(f(x_1),f(x_2),f(x_3))$ is multivariate Gaussian and therefore the distribution $Z$ is Gaussian. However, for a Gaussian $Z$, $Pr(Z=0)$ is either $0$ or $1$, a contradiction.
TL; DR Kalman filters and smoothers can be viewed as solvers for a family of Gaussian process regression models, in particular, Markov Gaussian processes.
Say, for example, we have a GP regression model
$$
\begin{split}
U(t) &\sim \mathrm{GP}(0, C(t, t')),\\
Y_k &= U(t_k) + \xi_k,
\end{split}\tag{1}
$$
Now, what is the goal of GP regression? The goal is to learn the posterior distribution $p(u_{1:T} \mid y_{1:T})$ jointly for any times $t_1,\ldots,t_T$ with data $y_{1:T}$. However, this is known to be expensive, as you need to solve matrix inversion of size $T$. Also, in practice, we are mostly interested with the marginal posterior $p(u_k \mid y_{1:T})$ for $k=1,2,\ldots,T$ instead of the joint one. So, is there any efficient solver for $\{p(u_k \mid y_{1:T})\}_{k=1}^T$?
Yes, suppose that the covariance function $C$ is chosen properly, in the sense that $U$ is a Markov (Gaussian) process and is governed by an stochastic differential equation
$$
\begin{split}
\mathrm{d} \overline{U}(t) &= A \, \overline{U}(t) \,\mathrm{d}t + B \, \mathrm{d}W(t), \\
U(t_0) &= U_0 \sim N(0, P_0)
\end{split}\tag{2}
$$
and that $U = H \, \overline{U}$ for some matrix $H$. Then estimating $p(\overline{u}_k \mid y_{1:T})$ from the data is called a smoothing problem in stochastic calculus. This can be solved by using Kalman filters and smoothers in linear computational time. You can also discretise the SDE above into
$$
\overline{U}_k = F \, \overline{U}_{k-1} + q_{k-1},\tag{3}
$$
for some matrix $F$ and Gaussian r.v. $q$, if this kind of state-space form looks more familiar to you.
By using Kalman filter and smoothing on (2) or (3), you will get exactly the same results for solving the GP model (1) using the standard GP regression equations (you could check Figure 4.1 in 1).
So, eventually, what are the key differences between the conventional GPs and state-space GPs?
In conventional GPs, you specify their mean and covariane functions. In state-space GPs, you specify the SDE coefficients instead, and their mean and covariane functions will be implicitly defined from these SDE coefficients.
In conventional GPs, you usually solve their regression problem jointly at data points. In state-space, you solve their regression problem marginally at data points, in a linear computational time. This benefits from the Markov property of the state-space models.
Also please note that not all GPs are Markov (Gaussian) processes thus, not all GPs can be represented in SDEs!
For more detailed expositions, you are welcome to check my dissertation. Feel free to ask me more questions on this topic if you have any.
Some heuristic explanations to (2) as per the request from @jbuddy_13 and for those who are not familiar with SDEs:
Solutions of SDEs are stochastic processes, in particular, continuous-time Markov processes (also semimartingales, but don't think Markov <=> semimartingale). You can think of SDEs as means to construct stochastic processes. Ito's theory was really devoted for constucting diffusion processes. Now since GPs are genuinly stochastic processes, it makes perfect sense to construct GPs via SDEs. It turns out that the solutions to linear SDEs like (2) are indeed (Markov) GPs, and the mean and covariance functions of GP $\overline{U}$ are governed by certain ODEs (see, Eq. 4.9 in 1).
The SDE in (2) has three main components:
$A \, \overline{U}$ is called the drift of the SDE (I will call $A$ the drift matrix), and $B$ is called the dispersion of the SDE. The drift term is to model the infinitesimal change of $\overline{U}$ in time, while the dispersion term adds stochastic volatility to the change.
$t\mapsto W(t)$ is a Wiener process (also interchangably called Brownian motion). This is where the randomness of (2) coming from.
$U_0$ the initial condition, a Gaussian random variable. This is also where the randomness of (2) coming from.
More intuitively, let's "divide both side of (2) by $\mathrm{d}t$" we get:
$$
\frac{\mathrm{d} \overline{U}(t)}{\mathrm{d}t} = A \, \overline{U}(t) + w(t),
$$
where $w(t)$ informally stands for $d W(t)/dt$. You can see that the SDE is essentially an ODE driven by a white noise process.
Best Answer
Here are a few remarks. Perhaps someone else can fill in the details.
1) Basis representations are always a good idea. It's hard to avoid them if you want to actually do something computational with your covariance function. The basis expansion can give you an approximation to the kernel and something to work with. The hope is that you can find a basis that makes sense for the problem you are trying to solve.
2) I'm not quite sure what you mean by a configuration in this question. At least one of the basis functions will need to be a function of $\theta$ for the kernel to depend on $\theta$. So yes, the eigenfunctions will vary with the parameter. They would also vary with different parametrizations.
Typically, the number of basis functions will be (countably) infinite, so the number won't vary with the parameter, unless some values caused the kernel to become degenerate.
I also don't understand what you mean in point 2 about the process having Bayesian prior $w \sim \mathcal{N}(0,diag[\lambda_1^2, \ldots])$ since $w$ has not been mentioned up until this point. Also,$diag[\lambda_1^2, \ldots]$ seems to be an infinite dimensional matrix, and I have a problem with that.
3) Which set of basis functions form valid kernels? If you're thinking about an eigenbasis, then the functions need to be orthogonal with respect to some measure. There are two problems. 1) The resulting kernel has to be positive definite ... and that's OK if the $\lambda_i$ are positive. And 2) the expansion has to converge. This will depend on the $\lambda_i$, which need to dampen fast enough to ensure the convergence of the expression. Convergence will also depend on the domain of the $x$'s
If the basis functions are not orthogonal then it will be more difficult to show that a covariance defined from them is positive definite. Obviously, in that case you are not dealing with an eigen-expansion, but with some other way of approximating the function of interest.
However, I don't think people typically start from a bunch of functions and then try to build a covariance kernel from them.
RE: Differentiability of the kernel and differentiability of the basis functions. I don't actually know the answer to this question, but I would offer the following observation.
Functional analysis proceeds by approximating functions (from an infinite dimensional space) by finite sums of simpler functions. To make this work, everything depends on the type of convergence involved. Typically, if you are working on a compact set with strong convergence properties (uniform convergence or absolute summability) on the functions of interest, then you get the kind of intuitive result that you are looking for: the properties of the simple functions pass over to the limit function -- e.g. if the kernel is a differentiable function of a parameter, then the expansion functions must be differentiable functions of the same parameter, and vice-versa. Under weaker convergence properties or non-compact domains, this does not happen. In my experience, there's a counter-example to every "reasonable" idea one comes up with.
Note: To forestall possible confusion from readers of this question, note that the Gaussian expansion of point 1 is not an example of the eigen-expansion of point 2.