So $S_h = L_h\setminus(L_h\setminus y_h)$ is true. However from the question it looks like you are actually wondering if $S_h = (L\setminus(L\setminus y))_h$ and this would not be true.
Super simple example:
$K = \left( \begin{matrix} 1 & 0.5 \\ 0.5 & 1 \end{matrix}\right)$, $y = \left(\begin{matrix} 3 \\ 4 \end{matrix}\right)$
$L =\left( \begin{matrix} 1 & 0 \\ 0.5 & 0.866 \end{matrix}\right)$
$L^T \setminus L \setminus y = \left( \begin{matrix} 1.33 \\ 3.33 \end{matrix}\right)$
So $(L^T \setminus L \setminus y)_h = 1.33$, where $h=1$
However, $K_h = 1, y_h = 3$ then $L_h = 1$ and $(L_h^T \setminus L_h \setminus y_h) = 3$.
In short you have to perform the inversion of cholesky factor again.
The question appears to ask for a demonstration that Ridge Regression shrinks coefficient estimates towards zero, using a spectral decomposition. The spectral decomposition can be understood as an easy consequence of the Singular Value Decomposition (SVD). Therefore, this post starts with SVD. It explains it in simple terms and then illustrates it with important applications. Then it provides the requested (algebraic) demonstration. (The algebra, of course, is identical to the geometric demonstration; it merely is couched in a different language.)
The original source of this answer can be found in my regression course notes. This version corrects some minor errors.
What the SVD is
Any $n\times p$ matrix $X$, with $p \le n$, can be written $$X = UDV^\prime$$ where
$U$ is an $n\times p$ matrix.
- The columns of $U$ have length $1$.
- The columns of $U$ are mutually orthogonal.
- They are called the principal components of $X$.
$V$ is a $p \times p$ matrix.
- The columns of $V$ have length $1$.
- The columns of $V$ are mutually orthogonal.
- This makes $V$ a rotation of $\mathbb{R}^p$.
$D$ is a diagonal $p \times p$ matrix.
- The diagonal elements $d_{11}, d_{22}, \ldots, d_{pp}$ are not negative. These are the singular values of $X$.
- If we wish, we may order them from largest to smallest.
Criteria (1) and (2) assert that both $U$ and $V$ are orthonormal matrices. They can be neatly summarized by the conditions
$$U^\prime U = 1_p,\ V^\prime V = 1_p.$$
As a consequence (that $V$ represents a rotation), $VV^\prime = 1_p$ also. This will be used in the Ridge Regression derivation below.
What it does for us
It can simplify formulas. This works both algebraically and conceptually. Here are some examples.
The Normal Equations
Consider the regression $y = X\beta + \varepsilon$ where, as usual, the $\varepsilon$ are independent and identically distributed according to a law that has zero expectation and finite variance $\sigma^2$. The least squares solution via the Normal Equations is $$\hat\beta = (X^\prime X)^{-1} X^\prime y.$$ Applying the SVD and simplifying the resulting algebraic mess (which is easy) provides a nice insight:
$$(X^\prime X)^{-1} X^\prime = ((UDV^\prime)^\prime (UDV^\prime))^{-1} (UDV^\prime)^\prime \\= (VDU^\prime U D V^\prime)^{-1} (VDU^\prime) = VD^{-2}V^\prime VDU^\prime = VD^{-1}U^\prime.$$
The only difference between this and $X^\prime = VDU^\prime$ is that the reciprocals of the elements of $D$ are used! In other words, the "equation" $y=X\beta$ is solved by "inverting" $X$: this pseudo-inversion undoes the rotations $U$ and $V^\prime$ (merely by transposing them) and undoes the multiplication (represented by $D$) separately in each principal direction.
For future reference, notice that "rotated" estimates $V^\prime \hat\beta $ are linear combinations of "rotated" responses $U^\prime y$. The coefficients are inverses of the (positive) diagonal elements of $D$, equal to $d_{ii}^{-1}$.
Covariance of the coefficient estimates
Recall that the covariance of the estimates is $$\text{Cov}(\hat\beta) = \sigma^2(X^\prime X)^{-1}.$$ Using the SVD, this becomes $$\sigma^2(V D^2 V^\prime)^{-1} = \sigma^2 V D^{-2} V^\prime.$$ In other words, the covariance acts like that of $k$ orthogonal variables, each with variances $d^2_{ii}$, that have been rotated in $\mathbb{R}^k$.
The Hat matrix
The hat matrix is $$H = X(X^\prime X)^{-1} X^\prime.$$ By means of the preceding result we may rewrite it as $$H = (UDV^\prime)(VD^{-1}U^\prime) = UU^\prime.$$ Simple!
Eigenanalysis (spectral decomposition)
Since $$X^\prime X = VDU^\prime U D V^\prime = VD^2V^\prime$$ and $$XX^\prime = UDV^\prime VDU^\prime = UD^2U^\prime,$$ it is immediate that
- The eigenvalues of $X^\prime X$ and $XX^\prime$ are the squares of the singular values.
- The columns of $V$ are the eigenvectors of $X^\prime X$.
- The columns of $U$ are some of the eigenvectors of $X X^\prime$. (Other eigenvectors exist but correspond to zero eigenvalues.)
SVD can diagnose and solve collinearity problems.
Approximating the regressors
When you replace the smallest singular values with zeros, you will change the product $UDV^\prime$ only slightly. Now, however, the zeros eliminate the corresponding columns of $U$, effectively reducing the number of variables. Provided those eliminated columns have little correlation with $y$, this can work effectively as a variable-reduction technique.
Ridge Regression
Let the columns of $X$ be standardized, as well as $y$ itself. (This means we no longer need a constant column in $X$.) For $\lambda \gt 0$ the ridge estimator is $$\begin{aligned}\hat\beta_R &= (X^\prime X + \lambda)^{-1}X^\prime y \\
&= (VD^2V^\prime + \lambda\,1_p)^{-1}VDU^\prime y \\
&= (VD^2V^\prime + \lambda V V^\prime)^{-1}VDU^\prime y \\
&= (V(D^2 + \lambda)V^\prime)^{-1} VDU^\prime y \\
&= V(D^2+\lambda)^{-1}V^\prime V DU^\prime y \\
&= V(D^2 + \lambda)^{-1} D U^\prime y.\end{aligned}$$
The difference between this and $\hat\beta$ is the replacement of $D^{-1} = D^{-2}D$ by $(D^2+\lambda)^{-1}D$. In effect, this multiplies the original by the fraction $D^2/(D^2+\lambda)$. Because (when $\lambda \gt 0$) the denominator is obviously greater than the numerator, the parameter estimates "shrink towards zero."
This result has to be understood in the somewhat subtle sense alluded to previously: the rotated estimates $V^\prime\hat\beta_R$ are still linear combinations of the vectors $U^\prime y$, but each coefficient--which used to be $d_{ii}^{-1}$--has been multiplied by a factor of $d_{ii}^2/(d_{ii}^2 + \lambda)$. As such, the rotated coefficients must shrink, but it is possible, when $\lambda$ is sufficiently small, for some of the $\hat\beta_R$ themselves actually to increase in size.
To avoid distractions, the case of one of more zero singular values was excluded in this discussion. In such circumstances, if we conventionally take "$d_{ii}^{-1}$" to be zero, then everything still works. This is what is going on when generalized inverses are used to solve the Normal equations.
Best Answer
First, we note that ridge regression can be transformed into an OLS problem via the data augmentation trick.. This adds $p$ rows to the design matrix $\mathbf{X}$, so the new matrix $\mathbf{X}_\lambda$ has dimensions $ (n + p) \times p $.
We can then look up the run time for various OLS algorithms, for example in the book Least Squares Data Fitting with Applications or in Golub's own book on Matrix Computations. (Golub was one of inventors of currently state-of-the-art Golub-Reinsch SVD algorithm.) Since linking to Google Books is not very reliable, these costs can also be found, for example, here.
We see that relevant costs for the non-ridge case, assuming $\mathbf{X}$ is $n \times p$, are:
$$ C_{LU} = 2 n p^2 - \frac{2}{3} p^3 \tag{1} $$
$$ C_{SVD} = 4 n p^2 + 8 p^3 \tag{2} $$
To obtain equations for ridge regression, we need to substitute $n \rightarrow n + p $ to correctly account for data augmentation:
$$ C_{LU} = 2 n p^2 + \frac{4}{3} p^3 \tag{3} $$
$$ C_{SVD} = 4 n p^2 + 12 p^3 \tag{4} $$
Note that unlike your example, $n > p$ for most typical regression problems. If we say that $n \approx p$, we can obtained simplified costs in only one variable:
$$ C_{LU} = \frac{10}{3} p^3 \tag{5} $$ $$ C_{SVD} = 16 p^3 \tag{6} $$
This is probably where those rough estimates you reference came from, but as you yourself discovered, this simplification hides important detail and makes inappropriate assumptions that don't fit your case. Note, for example, that the constant for SVD is much worse than for Cholesky, an important fact hidden by the big $\mathord{O}$ notation.
What you are doing is exactly the opposite - you have $n \ll p$. That means the second term is much more important! By comparing the coefficients of the $p^3$ terms of equations (3) and (4) we can argue the SVD approach will require roughly 10 times as many floating point operations as the Cholesky approach when $n \ll p$.
Beyond that, it's hard to know exactly what scipy is doing under the hood - is it calculating both $\mathbf{U}$ and $\mathbf{V}$ when doing SVD? Is it using the randomized algorithm? Is it using data augmentation for ridge regression, or some other approach? And even once those details are known, it's hard to translate these theoretical operation counts to real-world run time because libraries like LAPACK (scipy is almost surely using LAPACK or ATLAS under the hood) are so highly vectorized (taking advantage of CPU SIMD instructions) and often multithreaded that's it's hard to estimate accurately from purely theoretical arguments. But it's safe to say that it's reasonable to expect Cholesky factorization to be much faster than SVD for a given problem, especially when $n \ll p$.