If we were to run a kernel ridge regression (or SVM or whatever) on these features using a polynomial kernel of degree 2, it is my understanding that this would be equivalent to mapping your 2 dimensions to a feature space of all pairwise products and squares between the two dimensions, along with appropriate coefficients, and then performing linear regression in that space.
That's correct. A kernel method is equivalent to the corresponding linear method operating on the feature space defined by the kernel function. In the case of the polynomial kernel with degree 2, the features consist of squared, linear, interaction, and constant terms. See this wikipedia page for details. Performing kernel ridge regression would be equivalent to performing ordinary (linear) ridge regression on these terms.
My confusion lies in the fact that the feature mapping that the literature says to use is some fixed mapping x1,x2 -> 1 + x1^2 + x2^2 + sqrt(2) * x1x2, so the relative weights for each of those terms is fixed. However, if we were to run a linear regression based on those features, our model is free to learn any relative weighting that's optimal.
There are multiple mappings involved, and it sounds like you might be confusing some with each other. First, let's forget about kernels. Say we want to learn a nonlinear function from an input space to an output space (e.g. the real numbers for a regression problem). One way to solve this problem would be to first map the inputs into some feature space using a fixed nonlinear function $\Phi$ (called the feature space mapping). Given an input, it outputs a vector of features. Next, we learn a linear function from feature space to output space (e.g. using linear regression). We now have a way to map inputs nonlinearly to outputs. For example, in a regression problem, this function would look like $f(x) = \beta \cdot \Phi(x) + c$ (where $\beta$ is a weight vector containing a coefficient for each feature and $c$ is a constant).
Now, say we define a function $k$ (called the kernel function) that takes two elements of input space and returns their dot product in feature space. That is: $k(x,x') = \Phi(x) \cdot \Phi(x')$. It turns out that a linear model can be expressed using dot products. And, using this fact, it's possible to implicitly compute (and learn) the same mapping from inputs to outputs using only the inputs and the kernel function. This is called the kernel trick. In this case, it's not necessary to explicitly construct the feature space representations, or to deal with coefficients of a linear model in feature space. For example, in a regression problem, the function would look like:
$$f(x) = \sum_{i=1}^n \alpha_i k(x, x_i) + c$$
where $\{(x_i, y_i)\}$ are the training inputs/outputs used to fit the model and $\{\alpha_i\}$ are coefficients that must be learned. The reason to use the kernel trick is that it can be more computationally efficient than operating explicitly in feature space in some circumstances.
It sounds like you might have confused the kernel function (which is fixed) with the linear function from feature space to outputs (which is learned implicitly when using the kernel trick).
First, we note that ridge regression can be transformed into an OLS problem via the data augmentation trick.. This adds $p$ rows to the design matrix $\mathbf{X}$, so the new matrix $\mathbf{X}_\lambda$ has dimensions $ (n + p) \times p $.
We can then look up the run time for various OLS algorithms, for example in the book Least Squares Data Fitting with Applications or in Golub's own book on Matrix Computations. (Golub was one of inventors of currently state-of-the-art Golub-Reinsch SVD algorithm.) Since linking to Google Books is not very reliable, these costs can also be found, for example, here.
We see that relevant costs for the non-ridge case, assuming $\mathbf{X}$ is $n \times p$, are:
$$ C_{LU} = 2 n p^2 - \frac{2}{3} p^3 \tag{1} $$
$$ C_{SVD} = 4 n p^2 + 8 p^3 \tag{2} $$
To obtain equations for ridge regression, we need to substitute $n \rightarrow n + p $ to correctly account for data augmentation:
$$ C_{LU} = 2 n p^2 + \frac{4}{3} p^3 \tag{3} $$
$$ C_{SVD} = 4 n p^2 + 12 p^3 \tag{4} $$
Note that unlike your example, $n > p$ for most typical regression problems. If we say that $n \approx p$, we can obtained simplified costs in only one variable:
$$ C_{LU} = \frac{10}{3} p^3 \tag{5} $$
$$ C_{SVD} = 16 p^3 \tag{6} $$
This is probably where those rough estimates you reference came from, but as you yourself discovered, this simplification hides important detail and makes inappropriate assumptions that don't fit your case. Note, for example, that the constant for SVD is much worse than for Cholesky, an important fact hidden by the big $\mathord{O}$ notation.
What you are doing is exactly the opposite - you have $n \ll p$. That means the second term is much more important! By comparing the coefficients of the $p^3$ terms of equations (3) and (4) we can argue the SVD approach will require roughly 10 times as many floating point operations as the Cholesky approach when $n \ll p$.
Beyond that, it's hard to know exactly what scipy is doing under the hood - is it calculating both $\mathbf{U}$ and $\mathbf{V}$ when doing SVD? Is it using the randomized algorithm? Is it using data augmentation for ridge regression, or some other approach? And even once those details are known, it's hard to translate these theoretical operation counts to real-world run time because libraries like LAPACK (scipy is almost surely using LAPACK or ATLAS under the hood) are so highly vectorized (taking advantage of CPU SIMD instructions) and often multithreaded that's it's hard to estimate accurately from purely theoretical arguments. But it's safe to say that it's reasonable to expect Cholesky factorization to be much faster than SVD for a given problem, especially when $n \ll p$.
Best Answer
Cholesky decomposition also has a complexity of $O(n^3)$ operations, so it has the same computational scaling properties as Gaussian Process regression. If you want to reduce time complexity, you could investigate a sparse approximation of the least-squares support vector machine (LSSVM) which is equivalent to dual ridge regression, which I have found useful, see e.g.
Gavin C Cawley and Nicola LC Talbot, "Improved sparse least-squares support vector machines", Neurocomputing, volume 48, number 1, pages 1025-1031, 2002
and
Gavin C. Cawley and Nicola LC Talbot, "A greedy training algorithm for sparse least-squares support vector machines", International Conference on Artificial Neural Networks—ICANN 2002, pages 681 686, 2002
If greedy selection of representation vectors is too slow, simply picking a subset of the data at random tends to work quite well (as long as all of the training data appear in the loss function).
I am not very keen on support vector regression as the $\epsilon$-insensitive loss function doesn't have a straightforward probabilistic interpretation, which is often important in regression applications.