One can only guess what one particular author might mean by "shared variance." We might hope to circumscribe the possibilities by considering what properties this concept ought (intuitively) to have. We know that "variances add": the variance of a sum $X+\varepsilon$ is the sum of the variances of $X$ and $\varepsilon$ when $X$ and $\varepsilon$ have zero covariance. It is natural to define the "shared variance" of $X$ with the sum to be the fraction of the variance of the sum represented by the variance of $X$. This is enough to imply the shared variances of any two random variables $X$ and $Y$ must be the square of their correlation coefficient.
This result gives meaning to the interpretation of a squared correlation coefficient as a "shared variance": in a suitable sense, it really is a fraction of a total variance that can be assigned to one variable in the sum.
The details follow.
Principles and their implications
Of course if $Y=X$, their "shared variance" (let's call it "SV" from now on) ought to be 100%. But what if $Y$ and $X$ are just scaled or shifted versions of one another? For instance, what if $Y$ represents the temperature of a city in degrees F and $X$ represents the temperature in degrees C? I would like to suggest that in such cases $X$ and $Y$ should still have 100% SV, so that this concept will remain meaningful regardless of how $X$ and $Y$ might be measured:
$$\operatorname{SV}(\alpha + \beta X, \gamma + \delta Y) = \operatorname{SV}(X,Y)\tag{1}$$
for any numbers $\alpha, \gamma$ and nonzero numbers $\beta, \delta$.
Another principle might be that when $\varepsilon$ is a random variable independent of $X$, then the variance of $X+\varepsilon$ can be uniquely decomposed into two non-negative parts,
$$\operatorname{Var}(X+\varepsilon) = \operatorname{Var}(X) + \operatorname{Var}(\varepsilon),$$
suggesting we attempt to define SV in this special case as
$$\operatorname{SV}(X, X+\varepsilon) = \frac{\operatorname{Var}(X)}{\operatorname{Var}(X) + \operatorname{Var}(\epsilon)}.\tag{2}$$
Since all these criteria are only up to second order--they only involve the first and second moments of the variables in the forms of expectations and variances--let's relax the requirement that $X$ and $\varepsilon$ be independent and only demand that they be uncorrelated. This will make the analysis much more general than it otherwise might be.
The results
These principles--if you accept them--lead to a unique, familiar, interpretable concept. The trick will be to reduce the general case to the special case of a sum, where we can apply definition $(2)$.
Given $(X,Y)$, we simply attempt to decompose $Y$ into a scaled, shifted version of $X$ plus a variable that is uncorrelated with $X$: that is, let's find (if it's possible) constants $\alpha$ and $\beta$ and a random variable $\epsilon$ for which
$$Y = \alpha + \beta X + \varepsilon\tag{3}$$
with $\operatorname{Cov}(X, \varepsilon)=0$. For the decomposition to have any chance of being unique, we should demand
$$\mathbb{E}[\varepsilon]=0$$
so that once $\beta $ is found, $\alpha$ is determined by
$$\alpha = \mathbb{E}[Y] - \beta\, \mathbb{E}[X].$$
This looks an awful lot like linear regression and indeed it is. The first principle says we may rescale $X$ and $Y$ to have unit variance (assuming they each have nonzero variance) and that when it is done, standard regression results assert the value of $\beta$ in $(3)$ is the correlation of $X$ and $Y$:
$$\beta = \rho(X,Y)\tag{4}.$$
Moreover, taking the variances of $(1)$ gives
$$1 = \operatorname{Var}(Y) = \beta^2 \operatorname{Var}(X) + \operatorname{Var}(\varepsilon) = \beta^2 + \operatorname{Var}(\varepsilon),$$
implying
$$\operatorname{Var}(\varepsilon) = 1-\beta^2 = 1-\rho^2.\tag{5}$$
Consequently
$$\eqalign{
\operatorname{SV}(X,Y) &= \operatorname{SV}(X, \alpha+\beta X + \varepsilon) &\text{(Model 3)}\\
&= \operatorname{SV}(\beta X, \beta X + \varepsilon) &\text{(Property 1)}\\
&= \frac{\operatorname{Var}(\beta X)}{\operatorname{Var}(\beta X) + \operatorname{Var}(\epsilon)} & \text{(Definition 2)}\\
&= \frac{\beta^2}{\beta^2 + (1-\beta^2)} = \beta^2 &\text{(Result 5)}\\
& = \rho^2 &\text{(Relation 4)}.
}$$
Note that because the regression coefficient on $Y$ (when standardized to unit variance) is $\rho(Y,X)=\rho(X,Y)$, the "shared variance" itself is symmetric, justifying a terminology that suggests the order of $X$ and $Y$ does not matter:
$$\operatorname{SV}(X,Y) = \rho(X,Y)^2 = \rho(Y,X)^2 = \operatorname{SV}(Y,X).$$
Best Answer
Sometimes we can "augment knowledge" with an unusual or different approach. I would like this reply to be accessible to kindergartners and also have some fun, so everybody get out your crayons!
Given paired $(x,y)$ data, draw their scatterplot. (The younger students may need a teacher to produce this for them. :-) Each pair of points $(x_i,y_i)$, $(x_j,y_j)$ in that plot determines a rectangle: it's the smallest rectangle, whose sides are parallel to the axes, containing those points. Thus the points are either at the upper right and lower left corners (a "positive" relationship) or they are at the upper left and lower right corners (a "negative" relationship).
Draw all possible such rectangles. Color them transparently, making the positive rectangles red (say) and the negative rectangles "anti-red" (blue). In this fashion, wherever rectangles overlap, their colors are either enhanced when they are the same (blue and blue or red and red) or cancel out when they are different.
(In this illustration of a positive (red) and negative (blue) rectangle, the overlap ought to be white; unfortunately, this software does not have a true "anti-red" color. The overlap is gray, so it will darken the plot, but on the whole the net amount of red is correct.)
Now we're ready for the explanation of covariance.
The covariance is the net amount of red in the plot (treating blue as negative values).
Here are some examples with 32 binormal points drawn from distributions with the given covariances, ordered from most negative (bluest) to most positive (reddest).
They are drawn on common axes to make them comparable. The rectangles are lightly outlined to help you see them. This is an updated (2019) version of the original: it uses software that properly cancels the red and cyan colors in overlapping rectangles.
Let's deduce some properties of covariance. Understanding of these properties will be accessible to anyone who has actually drawn a few of the rectangles. :-)
Bilinearity. Because the amount of red depends on the size of the plot, covariance is directly proportional to the scale on the x-axis and to the scale on the y-axis.
Correlation. Covariance increases as the points approximate an upward sloping line and decreases as the points approximate a downward sloping line. This is because in the former case most of the rectangles are positive and in the latter case, most are negative.
Relationship to linear associations. Because non-linear associations can create mixtures of positive and negative rectangles, they lead to unpredictable (and not very useful) covariances. Linear associations can be fully interpreted by means of the preceding two characterizations.
Sensitivity to outliers. A geometric outlier (one point standing away from the mass) will create many large rectangles in association with all the other points. It alone can create a net positive or negative amount of red in the overall picture.
Incidentally, this definition of covariance differs from the usual one only by a universal constant of proportionality (independent of the data set size). The mathematically inclined will have no trouble performing the algebraic demonstration that the formula given here is always twice the usual covariance.