Solved – Is a polynomial kernel ridge regression really equivalent to performing a linear regression on those expanded features

kernel trickmachine learningridge regressionsvm

Say we have a dataset, X, which is Nx2 where N is the number of examples and 2 is the number of dimensions "features". If we were to run a kernel ridge regression (or SVM or whatever) on these features using a polynomial kernel of degree 2, it is my understanding that this would be equivalent to mapping your 2 dimensions to a feature space of all pairwise products and squares between the two dimensions, along with appropriate coefficients, and then performing linear regression in that space.

My question is: is what I said above true? My confusion lies in the fact that the feature mapping that the literature says to use is some fixed mapping x1,x2 -> 1 + x1^2 + x2^2 + sqrt(2) * x1x2, so the relative weights for each of those terms is fixed. However, if we were to run a linear regression based on those features, our model is free to learn any relative weighting that's optimal. In particular, it seems like the latter approach has more degrees of freedom (I'm not 100% clear on how to describe this confusion).

Best Answer

If we were to run a kernel ridge regression (or SVM or whatever) on these features using a polynomial kernel of degree 2, it is my understanding that this would be equivalent to mapping your 2 dimensions to a feature space of all pairwise products and squares between the two dimensions, along with appropriate coefficients, and then performing linear regression in that space.

That's correct. A kernel method is equivalent to the corresponding linear method operating on the feature space defined by the kernel function. In the case of the polynomial kernel with degree 2, the features consist of squared, linear, interaction, and constant terms. See this wikipedia page for details. Performing kernel ridge regression would be equivalent to performing ordinary (linear) ridge regression on these terms.

My confusion lies in the fact that the feature mapping that the literature says to use is some fixed mapping x1,x2 -> 1 + x1^2 + x2^2 + sqrt(2) * x1x2, so the relative weights for each of those terms is fixed. However, if we were to run a linear regression based on those features, our model is free to learn any relative weighting that's optimal.

There are multiple mappings involved, and it sounds like you might be confusing some with each other. First, let's forget about kernels. Say we want to learn a nonlinear function from an input space to an output space (e.g. the real numbers for a regression problem). One way to solve this problem would be to first map the inputs into some feature space using a fixed nonlinear function $\Phi$ (called the feature space mapping). Given an input, it outputs a vector of features. Next, we learn a linear function from feature space to output space (e.g. using linear regression). We now have a way to map inputs nonlinearly to outputs. For example, in a regression problem, this function would look like $f(x) = \beta \cdot \Phi(x) + c$ (where $\beta$ is a weight vector containing a coefficient for each feature and $c$ is a constant).

Now, say we define a function $k$ (called the kernel function) that takes two elements of input space and returns their dot product in feature space. That is: $k(x,x') = \Phi(x) \cdot \Phi(x')$. It turns out that a linear model can be expressed using dot products. And, using this fact, it's possible to implicitly compute (and learn) the same mapping from inputs to outputs using only the inputs and the kernel function. This is called the kernel trick. In this case, it's not necessary to explicitly construct the feature space representations, or to deal with coefficients of a linear model in feature space. For example, in a regression problem, the function would look like:

$$f(x) = \sum_{i=1}^n \alpha_i k(x, x_i) + c$$

where $\{(x_i, y_i)\}$ are the training inputs/outputs used to fit the model and $\{\alpha_i\}$ are coefficients that must be learned. The reason to use the kernel trick is that it can be more computationally efficient than operating explicitly in feature space in some circumstances.

It sounds like you might have confused the kernel function (which is fixed) with the linear function from feature space to outputs (which is learned implicitly when using the kernel trick).