It would be a good idea to get the optimisation code to print out the hyper-parameters each time it performs a function evaluation. Usually you can work out what is going wrong once you know what the model decides about the hyper-parameter values.
I suspect what is happening is that the model has decided that an essentially linear classifier would be best, and has set $\sigma_n$ to zero, as the noise component is not seen as necessary (which is a shame as it adds a ridge to the covariance matrix, which helps ensure that it is p.d.). In that case, only the SEiso bit is used, so $\sigma_f$ will probably be much larger than $\sigma_n$, however to make a linear classifier it will try and make $l$ as large as possible, which seems to end up resulting in numerical problems when evaluating the bit inside the exponential. I'm a pretty heavy user of GPML and have seen this a fair bit. One solution is to limit the magnitudes of the logarithms of the hyper-parameters during the search (which is equivalent to having a hyper-prior on the hyper-parameters), which tends to prevent this from happening. If you print out the values of the hyper-parameters, then the last ones that get printed before it goes "bang" will give you a good idea where to place the limits. This tends not to affect performance very much, the generalisation error at such points in hyper-parameter tends to be fairly flat, which causes gradient descent methods to take large steps that put you far enough from the origin that you get numerical accuracy problems.
In short print out the hyper-parameter values at each step, whenever you run into numerical issues in model selection.
Recall the expression for the posterior variance of a Gaussian Process: $\operatorname{var}_{n}\left(f\left(\mathbf{x}_{*}\right)\right)=k\left(\mathbf{x}_{*}, \mathbf{x}_{*}\right)-\mathbf{k}_{*}^{\top}\left(K_n+\sigma^{2} I_n\right)^{-1} \mathbf{k}_{*}$. We want to show that $\text{var}_{n-1}(f(x_*)) \geq \text{var}_n(f(x_*))$. The first term is the same for both $\text{var}_n$ and $\text{var}_{n-1}$, so we just need to break down the second term in a way such that we can compare between $\text{var}_n$ and $\text{var}_{n-1}$.
Following the hint to look at section A.3 in the book, we see that for the partitioned matrix $A$ and its inverse $A^{-1}$,
\begin{align}
A = \begin{pmatrix} P & Q \\ R & S \end{pmatrix}, \qquad A^{-1} = \begin{pmatrix} \tilde P & \tilde Q \\ \tilde R & \tilde S \end{pmatrix},
\end{align}
\begin{align}
\tilde P & = P^{-1} + P^{-1}QMRP^{-1},\\
\tilde Q & = -P^{-1}QM,\\
\tilde R & = -MRP^{-1},\\
\tilde S & = M,\\
\text{where}\ M & = (S - RP^{-1}Q)^{-1}.\\
\end{align}
We can decompose the Gram matrix with added noise $K_n + \sigma^2I_n$ as follows:
\begin{align}
K_n + \sigma^2 I_n =
\begin{pmatrix}
K_{n-1} + \sigma^2 I_{n-1} & k_{n-1}(x')\\k_{n-1}(x')^\top & k(x',x') + \sigma^2
\end{pmatrix}
\end{align}
where $x'$ is the $n$th point sampled.
Writing out the decomposed inverse, we get
\begin{align}
(K_n + \sigma^2 I_n)^{-1} = \begin{pmatrix}\kappa + \kappa k_{n-1}(x')Mk_{n-1}(x')^\top\kappa & -\kappa k_{n-1}(x')M \\ -Mk_{n-1}(x')^\top \kappa & M \end{pmatrix},
\end{align}
with
\begin{align}
M & = (k(x', x') + \sigma^2 - k_{n-1}(x')^\top(K_{n-1} + \sigma^2I_{n-1})^{-1}k_{n-1}(x'))^{-1}, \\
\kappa & = (K_{n-1} + \sigma^2I_{n-1})^{-1}
\end{align}
Finally we can actually compute the term $k_*^\top(K_n + \sigma^2I_n)^{-1}k_*$ by multiplying through:
\begin{align}
k_*^\top(K_n + \sigma^2I_n)^{-1}k_* = & \begin{pmatrix}k_{n-1}(x^*)\\k'(x^*)\end{pmatrix}^\top\begin{pmatrix}\kappa + \kappa k_{n-1}(x')Mk_{n-1}(x')^\top\kappa & -\kappa k_{n-1}(x')M \\ -Mk_{n-1}(x')^\top\kappa & M \end{pmatrix}\begin{pmatrix}k_{n-1}(x^*)\\k'(x^*)\end{pmatrix}\\
& = \begin{pmatrix} k_{n-1}^\top(x^*)\kappa + k_{n-1}^\top(x^*)\kappa k_{n-1}(x')Mk_{n-1}(x')^\top\kappa - k'(x^*)Mk_{n-1}(x')^\top \kappa \\ -k_{n-1}^\top(x^*)\kappa k_{n-1}(x')M + k'(x^*)M \end{pmatrix} ^\top
\begin{pmatrix}k_{n-1}(x^*)\\k'(x^*)\end{pmatrix}
\end{align}
where $\kappa = (K_{n-1} + \sigma^2I_{n-1})^{-1}$ and $k'(x^*) = k(x', x^*)$.
We end up with
\begin{align}
k_*^\top(K_n + \sigma^2I_n)^{-1}k_* & = k_{n-1}^\top(x^*)\kappa k_{n-1}(x^*) + k_{n-1}^\top(x^*)\kappa k_{n-1}(x')Mk_{n-1}(x')^\top\kappa k_{n-1}(x^*)\\
& - k'(x^*)Mk_{n-1}(x')^\top\kappa k_{n-1}(x^*) - k_{n-1}(x^*)^\top\kappa k_{n-1}(x')Mk'(x^*) + k'(x^*)Mk'(x^*).
\end{align}
Note that the first term is simply the relevant term from $\text{var}_{n-1}$, so we just need to show that the subsequent terms are non-negative in order to show that the term is larger for $n$ compared to $n-1$ (and so the variance is smaller). Note that $M$ is the reciprocal of the variance at $x'$ plus $\sigma^2$, so it's positive and we can factor it out.
We are left with
\begin{align}
\alpha^2 - 2k'(x^*)\alpha + k'(x^*)^2,
\end{align}
where $\alpha = k_{n-1}^\top(x^*)\kappa k_{n-1}(x')$ (a scalar). So we have
\begin{align}
k_*^\top(K_n + \sigma^2I_n)^{-1}k_* & = k_{n-1}^\top(x^*)\kappa k_{n-1}(x^*) + (\alpha - k'(x^*))^2\\
& = k_{n-1}(x^*)^\top(K_{n-1} + \sigma^2I_{n-1})^{-1}k_{n-1}(x^*) + \tfrac{1}{M}(\alpha - k'(x^*))^2\\
& \geq k_{n-1}(x^*)^\top(K_{n-1} + \sigma^2I_{n-1})^{-1}k_{n-1}(x^*).
\end{align}
So the variance after $n$ points is smaller than the variance after $n-1$ points, with equality achieved when the quantity $\tfrac{1}{M}(k_{n-1}^\top(x^*)(K_{n-1} + \sigma^2I_{n-1})^{-1} k_{n-1}(x') - k(x^*, x'))$ is zero.
Best Answer
Noise parameter, $\sigma^2$, is the parameter of the likelihood function a.k.a noise function.
The one with $+\sigma^2$ is the variance of $y$ (observation). The one without is the variance of $f$ (latent variable = observation - noise). So they are off from each other by $\sigma^2$ which is the same for all values of input variable $x$.
The formulas look right to me. As you see the variance of $y$ (the noiseless observation) is also dependent on the noise parameter. It make sense too. Your estimation of noise would affect the uncertainty estimate (i.e. variance) of the (noiseless) latent variable.
To avoid confusion I would refer to them by $\mathrm{var}(y)$ and $\mathrm{var}(f)$.
One more thing: the two expressions you denoted by $\Sigma$ are scalars, not matrices. The covariance matrix is $K$ not $\Sigma$. $\Sigma$ is variance not covariance since it is about a single $1$-D variable (either $y$ or $f$).