Can I say something about the propensity to overfit in (A) versus (B)?
Provided that both grids cover a sufficient range, grid fineness doesn't really have anything to do with overfitting in this problem (though a coarse grid might underfit if it skips over a profitable interval). It's not as if testing too many values will somehow change what out-of-sample looks like.* In the case of these penalized regressions, we definitely want to optimize our penalized likelihood function for values $\lambda$, and it doesn't matter how many values of $\lambda$ we test, because out-of-sample performance for a fixed data set and fixed partitioning is entirely deterministic. More to the point, the out-of-sample metric is not at all altered by how many values $\lambda$ you test. A coarser grid might mean that you skip over the absolute minimum in your out-of-sample metric, but finding the absolute minimum probably isn't desirable in the first place because hyperparameters tend to be poorly estimated, and finite sample properties mean that data limitations will be a source noise in that estimate that will overwhelm slight changes in the distance between adjacent grid points: the standard error of your estimate will tend to swamp differences in grid fineness.
If you're truely concerned that out-of-sample performance metric might be overly optimistic, you could adopt the 1 standard error rule, which picks the most regularized model within 1 standard error of the minimum. That way, you're being slightly more conservative and picking a less complex model.
Can I determine the optimal grid fineness? How?
The LARS algorithm does not a priori define which values of $\lambda$ to check; rather, $\lambda$ is changed continuously and the algorithm checks for values of $\lambda$ for which a coefficient goes from 0 to a nonzero value. Those values of $\lambda$ where a new coefficient is nonzero are retained, with the observation that coefficient paths are piecewise linear in the case of the lasso, so there's no loss of information by just storing off the knots in that case. LARS only works when coefficient paths are piecewise linear, though. The ridge penalty never shrinks a coefficient to precisely zero, so all of your coefficient paths are smooth and always nonzero; likewise elastic net regressions (excluding the case of elastic net regressions which are also lasso regressions).
But most people use GLMNET because it's often faster. In terms of determining what grid of $\lambda$ to search over, I recommend reading the GLMNET article "Regularization Paths for Generalized Linear Models via Coordinate Descent" by Jerome Friedman, Trevor Hastie, and Rob Tibshirani. In it, they develop a very efficient algorithm for estimating ridge, lasso and elastic net regressions. The algorithm checks for a value of $\lambda_\text{max}$ for which $\beta$ is the zero vector, and then identifies a minimum value $\lambda_\text{min}$ relative to $\lambda_\text{max}$. Finally, they generate a sequence of values between the two uniformly on the log scale. This grid is sufficient for most purposes, though it does omit the property that you will know precisely when a coefficient is estimated at a nonzero value. Warm starts are used to provide solutions much more quickly, and it supports many common GLMs.
*You might be thinking about this from the perspective of an artificial neural network, where early stopping is sometimes used to accomplish regularization, but that's an entirely unrelated problem (namely, that the optimization algorithm is prevented from reaching an optimum, so the model is forced to be less complex).
Well, the parameters that represent higher exponentials (x3,x4) are drasticly increasing the complexity of our model. So shouldn't we penalize more for high w3,w4 values than we penalize for high w1,w2 values?
The reason we say that adding quadratic or cubic terms increases model complexity is that it leads to a model with more parameters overall. We don't expect a quadratic term to be in and of itself more complex than a linear term. The one thing that's clear is that, all other things being equal, a model with more covariates is more complex.
For the purposes of regularization, one generally rescales all the covariates to have equal mean and variance so that, a priori, they are treated as equally important. If some covariates do in fact have a stronger relationship with the dependent variable than others, then, of course, the regularization procedure won't penalize those covariates as strongly, because they'll have greater contributions to the model fit.
But what if you really do think a priori that one covariate is more important than another, and you can quantify this belief, and you want the model to reflect it? Then what you probably want to do is use a Bayesian model and adjust the priors for the coefficients to match your preexisting belief. Not coincidentally, some familiar regularization procedures can be construed as special cases of Bayesian models. In particular, ridge regression is equivalent to a normal prior on the coefficients, and lasso regression is equivalent to a Laplacian prior.
Best Answer
There is no such rule about specific order polynomials which is agnostic to your dataset. If any such rule existed, I would expect it to be a function of your data or your data generating process - without knowing something about that, it's hard to say. Without saying anything about specific order polynomials, your general statement is right - larger order polynomials are more likely to overfit. As the order of the polynomial increases, so does the variance of the estimator.
Yes, this is a common issue with higher order polynomials. It is similar in spirit to Runge's phenomenon. The common solutions are to find the best order via cross-validation (grid search), or by controlling the size of the coefficients with regularization.