One can only guess what one particular author might mean by "shared variance." We might hope to circumscribe the possibilities by considering what properties this concept ought (intuitively) to have. We know that "variances add": the variance of a sum $X+\varepsilon$ is the sum of the variances of $X$ and $\varepsilon$ when $X$ and $\varepsilon$ have zero covariance. It is natural to define the "shared variance" of $X$ with the sum to be the fraction of the variance of the sum represented by the variance of $X$. This is enough to imply the shared variances of any two random variables $X$ and $Y$ must be the square of their correlation coefficient.
This result gives meaning to the interpretation of a squared correlation coefficient as a "shared variance": in a suitable sense, it really is a fraction of a total variance that can be assigned to one variable in the sum.
The details follow.
Principles and their implications
Of course if $Y=X$, their "shared variance" (let's call it "SV" from now on) ought to be 100%. But what if $Y$ and $X$ are just scaled or shifted versions of one another? For instance, what if $Y$ represents the temperature of a city in degrees F and $X$ represents the temperature in degrees C? I would like to suggest that in such cases $X$ and $Y$ should still have 100% SV, so that this concept will remain meaningful regardless of how $X$ and $Y$ might be measured:
$$\operatorname{SV}(\alpha + \beta X, \gamma + \delta Y) = \operatorname{SV}(X,Y)\tag{1}$$
for any numbers $\alpha, \gamma$ and nonzero numbers $\beta, \delta$.
Another principle might be that when $\varepsilon$ is a random variable independent of $X$, then the variance of $X+\varepsilon$ can be uniquely decomposed into two non-negative parts,
$$\operatorname{Var}(X+\varepsilon) = \operatorname{Var}(X) + \operatorname{Var}(\varepsilon),$$
suggesting we attempt to define SV in this special case as
$$\operatorname{SV}(X, X+\varepsilon) = \frac{\operatorname{Var}(X)}{\operatorname{Var}(X) + \operatorname{Var}(\epsilon)}.\tag{2}$$
Since all these criteria are only up to second order--they only involve the first and second moments of the variables in the forms of expectations and variances--let's relax the requirement that $X$ and $\varepsilon$ be independent and only demand that they be uncorrelated. This will make the analysis much more general than it otherwise might be.
The results
These principles--if you accept them--lead to a unique, familiar, interpretable concept. The trick will be to reduce the general case to the special case of a sum, where we can apply definition $(2)$.
Given $(X,Y)$, we simply attempt to decompose $Y$ into a scaled, shifted version of $X$ plus a variable that is uncorrelated with $X$: that is, let's find (if it's possible) constants $\alpha$ and $\beta$ and a random variable $\epsilon$ for which
$$Y = \alpha + \beta X + \varepsilon\tag{3}$$
with $\operatorname{Cov}(X, \varepsilon)=0$. For the decomposition to have any chance of being unique, we should demand
$$\mathbb{E}[\varepsilon]=0$$
so that once $\beta $ is found, $\alpha$ is determined by
$$\alpha = \mathbb{E}[Y] - \beta\, \mathbb{E}[X].$$
This looks an awful lot like linear regression and indeed it is. The first principle says we may rescale $X$ and $Y$ to have unit variance (assuming they each have nonzero variance) and that when it is done, standard regression results assert the value of $\beta$ in $(3)$ is the correlation of $X$ and $Y$:
$$\beta = \rho(X,Y)\tag{4}.$$
Moreover, taking the variances of $(1)$ gives
$$1 = \operatorname{Var}(Y) = \beta^2 \operatorname{Var}(X) + \operatorname{Var}(\varepsilon) = \beta^2 + \operatorname{Var}(\varepsilon),$$
implying
$$\operatorname{Var}(\varepsilon) = 1-\beta^2 = 1-\rho^2.\tag{5}$$
Consequently
$$\eqalign{
\operatorname{SV}(X,Y) &= \operatorname{SV}(X, \alpha+\beta X + \varepsilon) &\text{(Model 3)}\\
&= \operatorname{SV}(\beta X, \beta X + \varepsilon) &\text{(Property 1)}\\
&= \frac{\operatorname{Var}(\beta X)}{\operatorname{Var}(\beta X) + \operatorname{Var}(\epsilon)} & \text{(Definition 2)}\\
&= \frac{\beta^2}{\beta^2 + (1-\beta^2)} = \beta^2 &\text{(Result 5)}\\
& = \rho^2 &\text{(Relation 4)}.
}$$
Note that because the regression coefficient on $Y$ (when standardized to unit variance) is $\rho(Y,X)=\rho(X,Y)$, the "shared variance" itself is symmetric, justifying a terminology that suggests the order of $X$ and $Y$ does not matter:
$$\operatorname{SV}(X,Y) = \rho(X,Y)^2 = \rho(Y,X)^2 = \operatorname{SV}(Y,X).$$
Best Answer
The reason that the SVD of the original matrix $X$ is preferred instead of the eigen-decomposition of the covariance matrix $C$ when doing PCA is that that the solution of the eigenvalue problem presented in the covariance matrix $C$ (where $C = \frac{1}{N-1}X_0^T X_0$, $X_0$ being the zero-centred version of the original matrix $X$) has a higher condition number than the corresponding problem presented by the original data matrix $X$. In short, the condition number of a matrix quantifies the sensitivity of the solution of a system of linear equations defined by that matrix to errors in the original data. The condition number strongly suggests (but does not fully determine) the quality of the system of linear equations' solution.
In particular as the covariance matrix $C$ is calculated by the cross-product of $X_0$ with itself, the ratio of the largest singular value of $X_0$ to the smallest singular value of $X_0$ is squared. That ratio is the condition number; values that are close to unity or generally below a few hundreds suggest a rather stable system. This is easy to see as follows:
Assume that $X_0 = USV^T$ where $U$ are the right singular vectors, $V$ are the left singular vectors and $S$ is the diagonal matrix holding the singular values of $X_0$, as $C = \frac{1}{N-1}X_0^TX_0$ then we can write: $C = \frac{1}{N-1} VS^TU^T USV^T = \frac{1}{N-1} V S^T S V^T = \frac{1}{N-1} V \Sigma V^T$. (Remember that the matrix $U$ is orthonormal so $U^TU = I$). ie. the singular values of $X_0^TX_0$ represented in $\Sigma$ are the square of the singular values of $X_0$ represented in $S$.
As you see while seemingly innocuous the cross-product $X_0^TX_0$ squares the condition number of the system you try to solve and thus makes the resulting system of equations (more) prone to numerical instability issues.
Some additional clarification particular to the paper linked: the estimate of the covariance matrix $C$ is immediately rank-degenerate in cases where $N < p $ which are the main focus of that paper; that's why the authors initially draw attention of the Marcenko–Pastur law about the distribution of singular values) and regularisation and banding techniques. Without such notions, working with $C$ or the inverse of $C$ (in the form of Cholesky factor of the inverse of $C$ as the authors do) is numerically unstable. The rationale as to why these covariance matrices are degenerate is exactly the same as above in the case of very large matrices: the condition number is squared. This is even more prominent in the $N < p$ case: an $N\times p$ matrix $X$ has at most $N$ non-zero singular values, the crossproduct of it with itself can also have at most $N$ non-zero singular values leading to rank-degeneracy (and therefore a "infinite" condition number). The paper presents a way to band the estimated covariance matrix given some particular conditions (the estimated $C$ has a Toepliz structure, the oracle $k$ representing the banding parameter can properly estimated, etc.) such as it is numerically stable.