Notation / setting
We are considering a GP regression model:
\begin{equation}
y_i = f(x_i) + \epsilon_i
\end{equation}
where
$y_i\in \mathbb{R}$,
$x_i \in \mathbb{R}^d$,
$f$ a Gaussian process (whose realizations are functions
$f:\mathbb{R}^d\rightarrow \mathbb{R}$),
\begin{equation}
f \sim \mathrm{GP}(m(x_i), \kappa(x_i,x_j)).
\end{equation}
$n$ datapoints
$(y_1,x_1), (y_2,x_2), (y_3,x_3),\ldots, (y_n,x_n)$ are given. (I use
$\kappa$ to distinguish the function from the matrices
$K(\cdot,\cdot)$ that contain values of
$\kappa$ evaluated at certain points. The question denotes both by
$K$)
How to handle $d$-dimensional inputs
The question covers computing the posterior predictive distribution for a test point (or
$p$ test points) in the case
$d=1$ and asks how to extend to the general
$d=2,3,\ldots$.
Answer: nothing changes and the formulas from the one-dimensional case work as well in this case. Note that $m$ is then a function from $\mathbb{R}^d$ to $\mathbb{R}$ and $\kappa$ a function from $\mathbb{R}^d \times \mathbb{R}^d$ to $\mathbb{R}$.
So, for example the matrix denoted by $K(x,x)$ in the question is a $n\times n$ matrix for which $K(x,x)_{i,j} = \kappa(x_i, x_j)$ ($x_i$ and $x_j$ are $d$-dimensional but since $\kappa$ maps two $d$-dimensional vectors to a scalar, $\kappa(x_i, x_j)$ is a scalar. Similarly for $K(x,x')$ and $K(x',x')$ where $x'$ are the test points.
Thus, the dimensions of the matrices in the predictive covariance equation are $(p\times p) - (p \times n)\,(n \times n)\,(n \times p)$ independent of whether the elements of the matrices are obtained by evaluating a function $\kappa(\cdot,\cdot)$ whose arguments are $1$-dimensional of a function $\kappa(\cdot, \cdot)$ whose arguments are $d$-dimensional. In fact, the inputs could even be in some space other than $\mathbb{R}^d$ (such as if we have a categorical predictor) as long as a positive-definite covariance function can be defined.
An extra remark about the SE kernel appearing in the question
The question mentions the SE kernel
\begin{equation}
k_{f}(x_{i},x_{j}) = \sigma^{2}\exp\!\Big(-\frac{1}{2\ell^{2}}\sum_{j=1}^{q}(x_{i,j}-x_{k,j})^{2}\Big)
\end{equation}
Note that this is already a function from
$\mathbb{R}^q \times \mathbb{R}^q$ to
$\mathbb{R}$ (with scalar inputs there would be no "
$x_{i,j}$ and
$x_{k,j}$" for different values of
$j$. And
$q$ should be
$d$ if
$d$ is the dimension of inputs.
Optionally, the length scale $\ell$ could be made different for each input dimension as $\ell_{j}$, such that the term $\frac{1}{2\ell_{j}^{2}}$ is instead placed inside the summation
A very useful project to grow intuition around Gaussian processes!
[I'll try to write this as intuitive rather than mathematical]
The two problems are very much related. The kernel you are using is firstly does not have any i.i.d. Gaussian noise on it (as you noted), however the data itself does have noise. That is not a good idea as you are telling your function to both be very smooth with a long length scale while also passing exactly through each point - this will cause an issue.
The covariance matrix will have non-zero eigenvalues but they will be very very close and the small computational precision of your computer starts to have effect. This is known as numerical instabilities. There are a number of ways to get around it:
1) add noise to the observations; that is to say add the $I\sigma_n^2$ to the process, the data clearly has noise so this is actually a good thing!
2) perform a low rank approximation to your GP; that is to say do an eigenvalue / eigenvector decomposition and clip all negligible eigenvalues. The covariance matrix is now low rank and you can easily invert the non-zero eigenvalues giving you a pseudo inverse of the covariance matrix. Be warned though that your uncertainty will be essentially zero as you will have only a few degrees of freedom and clearly many many points.
I believe the GPML solution you mentioned is another reasonable approach but hopefully you get the idea. The addition of $I\epsilon$ is very popular and essentially adds a small bit of noise until the matrix becomes well conditioned. This is how the GPs in GPy are implemented for example.
Best Answer
Found the answer after much searching. Ultimately, it comes down to the Loewner ordering of $K_{**}$ and $K_{*f}(K_{ff} + \sigma^2I)^{-1}K_{f*}$, the latter of which I will refer to as the product of all those kernel matrices using the notation $K_{*fff*}$. Essentially, the question is: Is the resulting matrix from the subtraction operation between $K_{**}$ and $K_{*fff*}$ PSD? This will only be true when $K_{**} \succeq K_{*fff*}$.
We know that $K_{**}$ is PSD if it came from a valid kernel, which for our purposes we say it has. $K_{*fff*}$ is also PSD because it came from the multiplication of 3 PSD matrices (as long as those matrices came from valid kernels i.e. kernels that follow Mercer's theorem). Therefore, we have two PSD matrices undergoing a subtraction operation, when will this result be PSD?
There are three outcomes that can occur from this subtraction,
If matrix $A$ is PSD, then $x^TAx \ge 0$; $\forall x \in \Re$ when $\Re$ represents all real numbers. We can use variables to represent the vector $x$, so for the 2D case we have something like,
$\begin{bmatrix} x & y \end{bmatrix} \begin{bmatrix} a & b\\ b & c \end{bmatrix} \begin{bmatrix} x\\ y \end{bmatrix} $,
which we can expand to a quadratic equation $ax^2 + bxy + bxy + cy^2 = ax^2 + 2bxy + cy^2$. We know that the squared terms will always be positive, but the middle $2bxy$ term can be negative, what determines if matrix $A$ is PSD is if the squared terms are more positive than the middle term is negative.
Using these ideas we can determine if $K_{**} - K_{*fff*}$ will be PSD or not by using these quadratics. $x^TK_{**}x - x^TK_{*fff*}x \ge 0$ or $x^TK_{**}x \ge x^TK_{*fff}x$; $\forall x \in \Re$. Meaning that $K_{**}$'s quadratic equation in the 2D case must always be strictly larger than or equal to the $K_{*fff*}$ quadratic equation. We can visualize this in 2D with this image.
Note that in this plot, the tan bowl shaped plot in the center (i.e. the smallest one in terms of diameter) is the $K_{**}$ quadratic. The outer red one is the $K_{*fff*}$ quadratic, and the teal bowl is the difference between them (the covariance quadratic). Note that this difference is a valid covariance matrix because it is PSD. If the $K_{**}$ quadratic is not strictly larger than $K_{*fff*}$ quadratic then this is what occurs when subtraction occurs.
Note that $K_{**}$ is not strictly larger and we see that the resulting covariance quadratic dips below zero at those areas where $K_{**}$ is not greater than $K_{*fff*}$. This occurred when the length scale to compute $K_{**}$ was the not the same as the length scales used to compute the resulting $K_{*fff*}$ matrix. We can see in this image the Eigen values of the $K_{**}$ matrix are not greater than the Eigen values of the $K_{*fff*}$ matrix in each of the Eigen vector directions.
This begs the question, can we have a case where we use different length scales (or potentially kernels, I haven't tried it though) of the same kernel for $K_{**}$ and $K_{*fff*}$ and get a PSD matrix. The answer is yes, as long as $K_{**}$'s quadratic is strictly greater than $K_{*fff*}$'s quadratic. Using a variation of the RBF equation
$K(x,x\prime) = \sigma_f e^{-l ||x-x\prime||}$
Note that the usual $\frac{1}{l^2}$ is just being encompassed into the $l$ term. They are equivalent. Using this RBF kernel function, I can give the $K_{ff}$ and $K_{*f}$ kernels the same length scale of 2.338620173077578 and the $K_{**}$ kernel a length scale of 3.3949857315049945, and the resulting covariance is a valid PSD matrix shown in the image below: This is an example of a valid covariance matrix that came from RBF kernels with differing length scales. The reason this can be useful is due to the potential need to describe the similarity between points differently depending on whether you are looking at the test set or training set. I am still unaware of how to choose the length scales purposefully, right now I have only been able to find valid PSD matrices by a random search of length scales. Future work will be devoted to finding a way of choosing differing length scales and knowing the resulting covariance matrix will be valid.