Yes feature scaling depends on the kernel and in general it's a good idea. The kernel is effectively a distance and if different features vary on different scales then this can matter. For the RBF kernel, for instance, we have
$$
K(x, x') = \exp\left(-\gamma ||x-x'||^2\right)
$$
so if one dimension takes much larger values than others then it will dominate the kernel values and you'll lose some signal in other dimensions. This applies to the linear kernel too.
But this doesn't apply to all kernels, since some have scaling built in. For example, you could do something like the ARD kernel or Mahalanobis kernel with
$$
K(x, x') = \exp\left(-\gamma (x-x')^T\hat \Sigma^{-1}(x-x')\right)
$$
where $\hat \Sigma$ is the sample covariance matrix or maybe just the diagonal matrix of individual variances. As a function of $x$ and $x'$ this is still PD so it's a valid kernel.
As a general strategy for deciding if this is an issue for any particular kernel, just do what they did in the linked question and try it with data like $x=(1000,1,2,3)$, $x'=(500, .5, 3, 2)$ and see if the first dimension necessarily dominates.
Another way to try to assess a given kernel is to try to see if it inherits scale issues from subfunctions. For example, consider the polynomial kernel $K_{poly}(x,x') = (a+cx^Tx')^d$. We can write this as a function of the linear kernel $x^Tx'$, which we already know to be sensitive to scale, and the map $z \mapsto (a+cz)^d$ won't undo scale issues, so we can see that the polynomial kernel inherits these issues. We can do a similar analysis by writing the RBF kernel as a function of the scale-sensitive $||x-x'||^2$.
If we were to run a kernel ridge regression (or SVM or whatever) on these features using a polynomial kernel of degree 2, it is my understanding that this would be equivalent to mapping your 2 dimensions to a feature space of all pairwise products and squares between the two dimensions, along with appropriate coefficients, and then performing linear regression in that space.
That's correct. A kernel method is equivalent to the corresponding linear method operating on the feature space defined by the kernel function. In the case of the polynomial kernel with degree 2, the features consist of squared, linear, interaction, and constant terms. See this wikipedia page for details. Performing kernel ridge regression would be equivalent to performing ordinary (linear) ridge regression on these terms.
My confusion lies in the fact that the feature mapping that the literature says to use is some fixed mapping x1,x2 -> 1 + x1^2 + x2^2 + sqrt(2) * x1x2, so the relative weights for each of those terms is fixed. However, if we were to run a linear regression based on those features, our model is free to learn any relative weighting that's optimal.
There are multiple mappings involved, and it sounds like you might be confusing some with each other. First, let's forget about kernels. Say we want to learn a nonlinear function from an input space to an output space (e.g. the real numbers for a regression problem). One way to solve this problem would be to first map the inputs into some feature space using a fixed nonlinear function $\Phi$ (called the feature space mapping). Given an input, it outputs a vector of features. Next, we learn a linear function from feature space to output space (e.g. using linear regression). We now have a way to map inputs nonlinearly to outputs. For example, in a regression problem, this function would look like $f(x) = \beta \cdot \Phi(x) + c$ (where $\beta$ is a weight vector containing a coefficient for each feature and $c$ is a constant).
Now, say we define a function $k$ (called the kernel function) that takes two elements of input space and returns their dot product in feature space. That is: $k(x,x') = \Phi(x) \cdot \Phi(x')$. It turns out that a linear model can be expressed using dot products. And, using this fact, it's possible to implicitly compute (and learn) the same mapping from inputs to outputs using only the inputs and the kernel function. This is called the kernel trick. In this case, it's not necessary to explicitly construct the feature space representations, or to deal with coefficients of a linear model in feature space. For example, in a regression problem, the function would look like:
$$f(x) = \sum_{i=1}^n \alpha_i k(x, x_i) + c$$
where $\{(x_i, y_i)\}$ are the training inputs/outputs used to fit the model and $\{\alpha_i\}$ are coefficients that must be learned. The reason to use the kernel trick is that it can be more computationally efficient than operating explicitly in feature space in some circumstances.
It sounds like you might have confused the kernel function (which is fixed) with the linear function from feature space to outputs (which is learned implicitly when using the kernel trick).
Best Answer
The kernel matrix is $m\times m$ if you have $m$ training examples because it's the dot product matrix in the transformed space. But, weight vector has nothing to do with $m$ and its dimension is the original feature dimension (+1 if you embed the bias term inside the weights) when the kernel is linear. When it's not, you don't calculate the weights since you may not know the explicit transformation. The kernel is directly used in the task, e.g. an explicit formulation for binary classification.