Here are a few remarks. Perhaps someone else can fill in the details.
1) Basis representations are always a good idea. It's hard to avoid them if you want to actually do something computational with your covariance function. The basis expansion can give you an approximation to the kernel and something to work with. The hope is that you can find a basis that makes sense for the problem you are trying to solve.
2) I'm not quite sure what you mean by a configuration in this question. At least one of the basis functions will need to be a function of $\theta$ for the kernel to depend on $\theta$. So yes, the eigenfunctions will vary with the parameter. They would also vary with different parametrizations.
Typically, the number of basis functions will be (countably) infinite, so the number won't vary with the parameter, unless some values caused the kernel to become degenerate.
I also don't understand what you mean in point 2 about the process having Bayesian prior $w \sim \mathcal{N}(0,diag[\lambda_1^2, \ldots])$ since $w$ has not been mentioned up until this point. Also,$diag[\lambda_1^2, \ldots]$ seems to be an infinite dimensional matrix, and I have a problem with that.
3) Which set of basis functions form valid kernels? If you're thinking about an eigenbasis, then the functions need to be orthogonal with respect to some measure. There are two problems. 1) The resulting kernel has to be positive definite ... and that's OK if the $\lambda_i$ are positive. And 2) the expansion has to converge. This will depend on the $\lambda_i$, which need to dampen fast enough to ensure the convergence of the expression. Convergence will also depend on the domain of the $x$'s
If the basis functions are not orthogonal then it will be more difficult to show that a covariance defined from them is positive definite. Obviously, in that case you are not dealing with an eigen-expansion, but with some other way of approximating the function of interest.
However, I don't think people typically start from a bunch of functions and then try to build a covariance kernel from them.
RE: Differentiability of the kernel and differentiability of the basis functions. I don't actually know the answer to this question, but I would offer the following observation.
Functional analysis proceeds by approximating functions (from an infinite dimensional space) by finite sums of simpler functions. To make this work, everything depends on the type of convergence involved. Typically, if you are working on a compact set with strong convergence properties (uniform convergence or absolute summability) on the functions of interest, then you get the kind of intuitive result that you are looking for: the properties of the simple functions pass over to the limit function -- e.g. if the kernel is a differentiable function of a parameter, then the expansion functions must be differentiable functions of the same parameter, and vice-versa. Under weaker convergence properties or non-compact domains, this does not happen. In my experience, there's a counter-example to every "reasonable" idea one comes up with.
Note: To forestall possible confusion from readers of this question, note that the Gaussian expansion of point 1 is not an example of the eigen-expansion of point 2.
A very useful project to grow intuition around Gaussian processes!
[I'll try to write this as intuitive rather than mathematical]
The two problems are very much related. The kernel you are using is firstly does not have any i.i.d. Gaussian noise on it (as you noted), however the data itself does have noise. That is not a good idea as you are telling your function to both be very smooth with a long length scale while also passing exactly through each point - this will cause an issue.
The covariance matrix will have non-zero eigenvalues but they will be very very close and the small computational precision of your computer starts to have effect. This is known as numerical instabilities. There are a number of ways to get around it:
1) add noise to the observations; that is to say add the $I\sigma_n^2$ to the process, the data clearly has noise so this is actually a good thing!
2) perform a low rank approximation to your GP; that is to say do an eigenvalue / eigenvector decomposition and clip all negligible eigenvalues. The covariance matrix is now low rank and you can easily invert the non-zero eigenvalues giving you a pseudo inverse of the covariance matrix. Be warned though that your uncertainty will be essentially zero as you will have only a few degrees of freedom and clearly many many points.
I believe the GPML solution you mentioned is another reasonable approach but hopefully you get the idea. The addition of $I\epsilon$ is very popular and essentially adds a small bit of noise until the matrix becomes well conditioned. This is how the GPs in GPy are implemented for example.
Best Answer
If the locations of the knots (breakpoints) would be constants, the joint distribution of $x,x+\delta$ would depend on whether there is a breakpoint in the interval, not only on $\delta$ as stationarity requires. Therefore, I assume the question intends the knots to be random, as well. In that case, obtaining piecewise linear sample paths with positive probability is not possible, as shown below.
I consider the one-dimensional case and denote the value of the Gaussian process at input $x_k$ by $f(x_k)$.
If the sample paths are piecewise linear with random breakpoint locations, we must have three distinct inputs $x_1,x_2,x_3$ so that the event 'The three points $(x_1,f(x_1)),(x_2,f(x_2)),(x_3,f(x_3))$ are on the same line' has a probability that is nonzero and below one. \begin{equation} \frac{f(x_2) - f(x_1)}{x_2 - x_1} = \frac{f(x_3) - f(x_2)}{x_3 - x_2} \end{equation} or equivalently \begin{equation} (x_2 - x_3)\,f(x_1) + (x_3 - x_1)\, f(x_2) + (x_1 - x_2)\,f(x_3) = 0. \end{equation} That is, the probability of the points being on the same line is $Pr(Z=0)$ where \begin{equation} Z := (x_2 - x_3)\,f(x_1) + (x_3 - x_1)\, f(x_2) + (x_1 - x_2)\,f(x_3). \end{equation} Note that $Z$ is a linear combination of $f(x_1),f(x_2),f(x_3)$. Since $f$ is a Gaussian process, the joint distribution of $(f(x_1),f(x_2),f(x_3))$ is multivariate Gaussian and therefore the distribution $Z$ is Gaussian. However, for a Gaussian $Z$, $Pr(Z=0)$ is either $0$ or $1$, a contradiction.