The answer to both 1 and 2 is no, but care is needed in interpreting the existence theorem.
Variance of Ridge Estimator
Let $\hat{\beta^*}$ be the ridge estimate under penalty $k$, and let $\beta$ be the true parameter for the model $Y = X \beta + \epsilon$. Let $\lambda_1, \dotsc, \lambda_p$ be the eigenvalues of $X^T X$.
From Hoerl & Kennard equations 4.2-4.5, the risk, (in terms of the expected $L^2$ norm of the error) is
$$
\begin{align*}
E \left( \left[ \hat{\beta^*} - \beta \right]^T \left[ \hat{\beta^*} - \beta \right] \right)& = \sigma^2 \sum_{j=1}^p \lambda_j/ \left( \lambda_j +k \right)^2 + k^2 \beta^T \left( X^T X + k \mathbf{I}_p \right)^{-2} \beta \\
& = \gamma_1 (k) + \gamma_2(k) \\
& = R(k)
\end{align*}
$$
where as far as I can tell, $\left( X^T X + k \mathbf{I}_p \right)^{-2} = \left( X^T X + k \mathbf{I}_p \right)^{-1} \left( X^T X + k \mathbf{I}_p \right)^{-1}.$ They remark that $\gamma_1$ has the interpretation of the variance of the inner product of $\hat{\beta^*} - \beta$, while $\gamma_2$ is the inner product of the bias.
Supposing $X^T X = \mathbf{I}_p$, then
$$R(k) = \frac{p \sigma^2 + k^2 \beta^T \beta}{(1+k)^2}.$$
Let
$$R^\prime (k) = 2\frac{k(1+k)\beta^T \beta - (p\sigma^2 + k^2 \beta^T \beta)}{(1+k)^3}$$ be the derivative of the risk w/r/t $k$.
Since $\lim_{k \rightarrow 0^+} R^\prime (k) = -2p \sigma^2 < 0$, we conclude that there is some $k^*>0$ such that $R(k^*)<R(0)$.
The authors remark that orthogonality is the best that you can hope for in terms of the risk at $k=0$, and that as the condition number of $X^T X$ increases, $\lim_{k \rightarrow 0^+} R^\prime (k)$ approaches $- \infty$.
Comment
There appears to be a paradox here, in that if $p=1$ and $X$ is constant, then we are just estimating the mean of a sequence of Normal$(\beta, \sigma^2)$ variables, and we know the the vanilla unbiased estimate is admissible in this case. This is resolved by noticing that the above reasoning merely provides that a minimizing value of $k$ exists for fixed $\beta^T \beta$. But for any $k$, we can make the risk explode by making $\beta^T \beta$ large, so this argument alone does not show admissibility for the ridge estimate.
Why is ridge regression usually recommended only in the case of correlated predictors?
H&K's risk derivation shows that if we think that $\beta ^T \beta$ is small, and if the design $X^T X$ is nearly-singular, then we can achieve large reductions in the risk of the estimate. I think ridge regression isn't used ubiquitously because the OLS estimate is a safe default, and that the invariance and unbiasedness properties are attractive. When it fails, it fails honestly--your covariance matrix explodes. There is also perhaps a philosophical/inferential point, that if your design is nearly singular, and you have observational data, then the interpretation of $\beta$ as giving changes in $E Y$ for unit changes in $X$ is suspect--the large covariance matrix is a symptom of that.
But if your goal is solely prediction, the inferential concerns no longer hold, and you have a strong argument for using some sort of shrinkage estimator.
The inverse of a block (or partitioned) matrix is given by
$$
\left[ \begin{array}{cc} M_{11} & M_{12} \\ M_{21} & M_{22} \end{array} \right] ^{-1} = \left[ \begin{array}{cc} K_1^{-1} & -M_{11}^{-1} M_{12}K_2^{-1} \\ -K_2^{-1} M_{21} M_{11}^{-1} & K_2^{-1} \end{array} \right],
$$
where $K_1 = M_{11} - M_{12} M_{22}^{-1} M_{21}$ and $K_2 = M_{22} - M_{21} M_{11}^{-1} M_{12}$.
When the matrix is block diagonal, this reduces to
$$
\left[ \begin{array}{cc} M_{11} & 0 \\ 0 & M_{22} \end{array} \right] ^{-1} = \left[ \begin{array}{cc} M_{11}^{-1} & 0 \\ 0 & M_{22}^{-1} \end{array} \right].
$$
These identities are in The Matrix Cookbook. The fact that the inverse of a block diagonal matrix has a simple, diagonal form will help you a lot. I don't know of a way to exploit the fact that the matrices are symmetric and positive definite.
To invert your matrix, let $M_{11} = \left[ \begin{array}{ccc} A & 0 & 0 \\ 0 & B & 0 \\ 0 & 0 & C \end{array} \right]$, $M_{12} = M_{21}' = \left[ \begin{array}{c} E \\ F \\ G \end{array} \right]$, and $M_{22} = D$.
Recursively apply the block diagonal inverse formula gives
$$
M_{11}^{-1} = \left[ \begin{array}{ccc} A & 0 & 0 \\ 0 & B & 0 \\ 0 & 0 & C \end{array} \right]^{-1} = \left[ \begin{array}{ccc} A^{-1} & 0 & 0 \\ 0 & B^{-1} & 0 \\ 0 & 0 & C^{-1} \end{array} \right].
$$
Now you can compute $C_1^{-1}$, $M_{11}^{-1}$, and $K_2^{-1}$, and plug into the first identity for the inverse of a partitioned matrix.
Best Answer
There are several good ways to do this using R. One classical method is to compute the Choleski factor of the covariance matrix:
This code requires C to be a strictly positive definite matrix. C has to be positive definite anyway in order to guarantee that the RSS is finite.
If you want to make the computation even more explicit, you could replace the last two lines with this:
The above code is performing the following mathematical steps. First, we factorize $$C = R^TR$$ where $R$ is an upper triangular matrix. Then we solve the linear systems $$R^Ty_c=y$$ and $$R^TX_c=X$$ for $y_c$ and $X_c$ using an efficient forward substitution algorithm. Note that the above two steps are far more efficient than inverting $C$.
From this point we can view this as an unweighted regression problem with $y_c$ and $X_c$. Amongst other things, the
lm.fit
function uses the QR decomposition of $X_c$ to find an $n\times(n-p)$ matrix $Q$ such that $Q^TQ=I$ and $Q^TX=0$. Here, $p$ is the column rank of $X$. The orthogonal residuals (or effects) can then be computed as $$e=Q^Ty$$ and finally the RSS is $e^Te$. Actually the function computed $Q^Ty$, where $Q$ was $n\times n$, and stored this vector infit$effects
. We then threw away the first $p$ values to get $e$.You might have been hoping for a simpler mathematical formula, but efficient computation requires that one avoids evaluating mathematical entities such as inverse matrices or ordinary residuals.