What is understood by variance in several dimensions ("total variance") is simply a sum of variances in each dimension. Mathematically, it's a trace of the covariance matrix: trace is simply a sum of all diagonal elements. This definition has various nice properties, e.g. trace is invariant under orthogonal linear transformations, which means that if you rotate your coordinate axes, the total variance stays the same.
What is proved in Bishop's book (section 12.1.1), is that the leading eigenvector of covariance matrix gives the direction of maximal variance. Second eigenvector gives the direction of maximal variance under an additional constraint that it should be orthogonal to the first eigenvector, etc. (I believe this constitutes the Exercise 12.1). If the goal is to maximize the total variance in the 2D subspace, then this procedure is a greedy maximization: first choose one axis that maximizes variance, then another one.
Your question is: why does this greedy procedure obtain a global maximum?
Here is a nice argument that @whuber suggested in the comments. Let us first align the coordinate system with the PCA axes. The covariance matrix becomes diagonal: $\boldsymbol{\Sigma} = \mathrm{diag}(\lambda_i)$. For simplicity we will consider the same 2D case, i.e. what is the plane with maximal total variance? We want to prove that it is the plane given by the first two basis vectors (with total variance $\lambda_1+\lambda_2$).
Consider a plane spanned by two orthogonal vectors $\mathbf{u}$ and $\mathbf{v}$. The total variance in this plane is $$\mathbf{u}^\top\boldsymbol{\Sigma}\mathbf{u} + \mathbf{v}^\top\boldsymbol{\Sigma}\mathbf{v} = \sum \lambda_i u_i^2 + \sum \lambda_i v_i^2 = \sum \lambda_i (u_i^2+v_i^2).$$ So it is a linear combination of eigenvalues $\lambda_i$ with coefficients that are all positive, do not exceed $1$ (see below), and sum to $2$. If so, then it is almost obvious that the maximum is reached at $\lambda_1 + \lambda_2$.
It is only left to show that the coefficients cannot exceed $1$. Notice that $u_k^2+v_k^2 = (\mathbf{u}\cdot\mathbf{k})^2+(\mathbf{v}\cdot\mathbf{k})^2$, where $\mathbf{k}$ is the $k$-th basis vector. This quantity is a squared length of a projection of $\mathbf k$ onto the plane spanned by $\mathbf u$ and $\mathbf v$. Therefore it has to be smaller than the squared length of $\mathbf k$ which is equal to $|\mathbf{k}|^2=1$, QED.
See also @cardinal's answer to What is the objective function of PCA? (it follows the same logic).
For what you describe, I highly recommend "Foundations of Machine Learning" by Mohri et.al. It is an undergraduate text, but it is for really good undergraduates. It is readable and it is the only place I have found what I would call a mathematical definition of machine learning (pac and weak pac). It is worth reading for that reason alone. I also have a math Phd. I'm familiar with, and like, many of the books mentioned above. I'm particularly fond of ESL for a broad spectrum of techniques and ideas, but it's a statistics book with lots of mathematics.
Best Answer
The book by Prince, recommended by @seanv507 is indeed an excellent book on the topic (+1). And while it is not really compact, it has very logical structure and even a generous refresher chapter on probability as well as great focus on machine learning within computer vision context.
However, I'd like to recommend another excellent book on the topic (also freely downloadable), which, while having more focus on computer vision per se, IMHO contains enough machine learning material to qualify for an answer. The book that I'm talking about is "Computer Vision: Algorithms and Applications" by Richard Szeliski (Microsoft Research). One of the advantages of this book versus the one by Price is... narrower margins, which allow for larger font size and, thus, better readability. Also, the book by Szeliski is very practical. Since both books share significant content, but have somewhat different focus, in my opinion, they very well complement each other. All this, among other advantages, makes it very easy for me to highly recommend Szeliski's book.