why would we care about "how much of the variance in the data is explained by the given regression model?"
To answer this it is useful to think about exactly what it means for a certain percentage of the variance to be explained by the regression model.
Let $Y_{1}, ..., Y_{n}$ be the outcome variable. The usual sample variance of the dependent variable in a regression model is $$ \frac{1}{n-1} \sum_{i=1}^{n} (Y_i - \overline{Y})^2 $$ Now let $\widehat{Y}_i \equiv \widehat{f}({\boldsymbol X}_i)$ be the prediction of $Y_i$ based on a least squares linear regression model with predictor values ${\boldsymbol X}_i$. As proven here, this variance above can be partitioned as:
$$ \frac{1}{n-1} \sum_{i=1}^{n} (Y_i - \overline{Y})^2 =
\underbrace{\frac{1}{n-1} \sum_{i=1}^{n} (Y_i - \widehat{Y}_i)^2}_{{\rm residual \ variance}} + \underbrace{\frac{1}{n-1} \sum_{i=1}^{n} (\widehat{Y}_i - \overline{Y})^2}_{{\rm explained \ variance}}
$$
In least squares regression, the average of the predicted values is $\overline{Y}$, therefore the total variance is equal to the averaged squared difference between the
observed and the predicted values (residual variance) plus the sample variance of the predictions themselves (explained variance), which are only a function of the ${\boldsymbol X}$s. Therefore the "explained" variance may be thought of as the variance in $Y_i$ that is attributable to variation in ${\boldsymbol X}_i$. The proportion of the variance in $Y_i$ that is "explained" (i.e. the proportion of variation in $Y_i$ that is attributable to variation in ${\boldsymbol X}_i$) is sometimes referred to as $R^2$.
Now we use two extreme examples make it clear why this variance decomposition is important:
(1) The predictors have nothing to do with the responses. In that case, the best unbiased predictor (in the least squares sense) for $Y_i$ is $\widehat{Y}_i = \overline{Y}$. Therefore the total variance in $Y_i$ is just equal to the residual variance and is unrelated to the variance in the predictors ${\boldsymbol X}_i$.
(2) The predictors are perfectly linearly related to the predictors. In that case, the predictions are exactly correct and $\widehat{Y}_i = Y_i$. Therefore there is no residual variance and all of the variance in the outcome is the variance in the predictions themselves, which are only a function of the predictors. Therefore all of the variance in the outcome is simply due to variance in the predictors ${\boldsymbol X}_i$.
Situations with real data will often lie between the two extremes, as will the proportion of variance that can be attributed to these two sources. The more "explained variance" there is - i.e. the more of the variation in $Y_i$ that is due to variation in ${\boldsymbol X}_i$ - the better the predictions $\widehat{Y}_{i}$ are performing (i.e. the smaller the "residual variance" is), which is another way of saying that the least squares model fits well.
Connection between James–Stein estimator and ridge regression
Let $\mathbf y$ be a vector of observation of $\boldsymbol \theta$ of length $m$, ${\mathbf y} \sim N({\boldsymbol \theta}, \sigma^2 I)$, the James-Stein estimator is,
$$\widehat{\boldsymbol \theta}_{JS} =
\left( 1 - \frac{(m-2) \sigma^2}{\|{\mathbf y}\|^2} \right) {\mathbf y}.$$
In terms of ridge regression, we can estimate $\boldsymbol \theta$ via $\min_{\boldsymbol{\theta}} \|\mathbf{y}-\boldsymbol{\theta}\|^2 + \lambda\|\boldsymbol{\theta}\|^2 ,$
where the solution is $$\widehat{\boldsymbol \theta}_{\mathrm{ridge}} = \frac{1}{1+\lambda}\mathbf y.$$
It is easy to see that the two estimators are in the same form, but we need to estimate $\sigma^2$ in James-Stein estimator, and determine $\lambda$ in ridge regression via cross-validation.
Connection between James–Stein estimator and random effects models
Let us discuss the mixed/random effects models in genetics first. The model is $$\mathbf {y}=\mathbf {X}\boldsymbol{\beta} + \boldsymbol{Z\theta}+\mathbf {e},
\boldsymbol{\theta}\sim N(\mathbf{0},\sigma^2_{\theta} I),
\textbf{e}\sim N(\mathbf{0},\sigma^2 I).$$
If there is no fixed effects and $\mathbf {Z}=I$, the model becomes
$$\mathbf {y}=\boldsymbol{\theta}+\mathbf {e},
\boldsymbol{\theta}\sim N(\mathbf{0},\sigma^2_{\theta} I),
\textbf{e}\sim N(\mathbf{0},\sigma^2 I),$$
which is equivalent to the setting of James-Stein estimator, with some Bayesian idea.
Connection between random effects models and ridge regression
If we focus on the random effects models above,
$$\mathbf {y}=\mathbf {Z\theta}+\mathbf {e},
\boldsymbol{\theta}\sim N(\mathbf{0},\sigma^2_{\theta} I),
\textbf{e}\sim N(\mathbf{0},\sigma^2 I).$$
The estimation is equivalent to solve the problem
$$\min_{\boldsymbol{\theta}} \|\mathbf{y}-\mathbf {Z\theta}\|^2 + \lambda\|\boldsymbol{\theta}\|^2$$
when $\lambda=\sigma^2/\sigma_{\theta}^2$. The proof can be found in Chapter 3 of Pattern recognition and machine learning.
Connection between (multilevel) random effects models and that in genetics
In the random effects model above, the dimension of $\mathbf y$ is $m\times 1,$ and that of $\mathbf Z$ is $m \times p$. If we vectorize $\mathbf Z$ as $(mp)\times 1,$ and repeat $\mathbf y$ correspondingly, then we have the hierarchical/clustered structure, $p$ clusters and each with $m$ units. If we regress $\mathrm{vec}(\mathbf Z)$ on repeated $\mathbf y$, then we can obtain the random effect of $Z$ on $y$ for each cluster, though it is kind of like reverse regression.
Acknowledgement: the first three points are largely learned from these two Chinese articles, 1, 2.
Best Answer
In ordinary least squares regression, the variance of the data in the regressors will end up in the denominator for the expression of the error of the parameters
$$\text{Cov} (\hat\beta) = \hat{\sigma}^2 (X^TX)^{-1} $$
If the columns in the regressor matrix $X$ are perpendicular (and they are if you use principle components) then the standard error can be expressed as
$$s.e.(\beta_i) \approx \sqrt{\frac{\sigma^2} {n \text{Var}(X_i)} }$$
So larger variance in the regressor $X_i$ means smaller variance/error in the estimate for the coefficient/slope/gradient.
See also the following image. It gives the same correlated data but with a different variance for the $x$ variable. This changes the slope. If the scale of the $x$ variable is smaller then the slope becomes larger (and also the error of the slope).