Iterative algorithm for computing $\Sigma^{1/2} x$

algorithmsmatricesnumerical linear algebrapositive-semidefinitesymmetric matrices

Say I have a PSD matrix $\Sigma$ and a vector $x$, is there an iterative algorithm (faster than computing $\Sigma^{1/2}$ using Cholesky decomposition) for computing $\Sigma^{1/2} x$?
(In this question, by $\Sigma^{1/2}$ I always just mean some matrix $L$ such that $\Sigma=LL^T$.)

For example, I might use Newton's method:
$
L_{k+1} = \frac12 (L_k + L^{−1}_k \Sigma), L_0 = \Sigma,
$

which converges to $L_{\infty} = \Sigma^{1/2}$.
If applied to a single vector:
\begin{align}
L_{k+1} x = \frac12 (L_k x + L^{−1}_k \Sigma x), L_0 = \Sigma,
\end{align}

I can compute each step in $O(n^2 \sqrt{\kappa})$ time using Conjugate gradient descent (where $\kappa$ is the condition number of $L_k$) to compute $L_k^{-1} (\Sigma x)$.

I'm not sure how many iterations of Newton I need (Or Denman-Beavers or some other type of iteration), but I assume it also depends on the condition number somehow.
All in all I can compute $\Sigma^{1/2} x$ in $O(n f(\kappa))$ time.

However, using gradient descent at each step of an iterative method seems silly. I doubt the above method is every competitive with just computing $\Sigma^{1/2}$ using Cholesky.
But I do wonder: Is there an iterative method to compute $\Sigma^{1/2} x$ fast?
For example something like the conjugate method (or even Jacobi iteration), but which has as purpose to compute the square root (or inverse square root) rather than the inverse?

(Btw, I should mention that the reason I want to compute $\Sigma^{1/2} x$ is to sample a multivariate normal distribution. If there's a different method to do this in $n^2$ time, I'd also be very interested in that.)

Update: According Computing Aα, log(A) and Related Matrix Functions by Contour Integrals Theorem 2.1, eq. 2.9, there is at least an iterative square root method that only takes $\approx\log(\kappa)$ many iterations, where each iteration requires a linear solve.
That means we can find $\Sigma^{1/2}x$ in $\approx n^2 \sqrt{\kappa} \log(\kappa)$.
Though I'm still curious if you could get just $n^2 \sqrt{\kappa}$ by utilizing that the later linear solves are very close to the previous ones.

Best Answer

It turns out the "Newton based" algorithm in my original question doesn't work, since it requires solving a linear system in $L_k$. But we don't actually keep $L_k$ around, only $L_k x$. Luckily there is another iterative method that does work: I found this interesting article Computing $A^\alpha$ , log(A) and Related Matrix Functions by Contour Integrals.

Translated from Matlab to Python, and specialized to finding the "squareroot-vector-product", it looks like this:

def sqrtm_vp(A, x, m=None, M=None, niter=10):
    if m is None or M is None:
        eigs = np.linalg.eig(A)[0]
        m, M = min(eigs), max(eigs)

    Kp = scipy.special.ellipk(1 - m / M)
    t = (0.5 + np.arange(niter)) * Kp / niter
    sn, cn, dn, _ = scipy.special.ellipj(t, 1 - m / M)
    w2 = m * (sn / cn) ** 2
    dzdt = (2 * Kp * np.sqrt(m) / (np.pi * niter)) * dn / cn**2

    d = A.shape[0]
    Sx = np.zeros(d)
    # Could do this in parallel
    for j in range(niter):
        Sx += dzdt[j] * np.linalg.solve(A + w2[j] * np.eye(d), x)
    return A @ Sx

The paper provides error bounds on the order of $\exp(-O(N/\log(\kappa)))$, meaning we need just $N=O(\log\kappa)$ iterations, each of which take a single linear solve, which is $O(d^2 \sqrt{\kappa})$ time with conjugate gradient descent.

I still don't know if it's possible to go from $d^2 \sqrt{\kappa}\log{\kappa}$ time to just $d^2 \sqrt{\kappa}$, but at least this is pretty close.