There are lots of penalized approaches that have all kinds of different penalty functions now (ridge, lasso, MCP, SCAD). The question of why is one of a particular form is basically "what advantages/disadvantages does such a penalty provide?".
Properties of interest might be:
1) nearly unbiased estimators (note all penalized estimators will be biased)
2) Sparsity (note ridge regression does not produce sparse results i.e. it does not shrink coefficients all the way to zero)
3) Continuity (to avoid instability in model prediction)
These are just a few properties one might be interested in a penalty function.
It is a lot easier to work with a sum in derivations and theoretical work: e.g. $||\beta||_2^2=\sum |\beta_i|^2$ and $||\beta||_1 = \sum |\beta_i|$. Imagine if we had $\sqrt{\left(\sum |\beta_i|^2\right)}$ or $\left( \sum |\beta_i|\right)^2$. Taking derivatives (which is necessary to show theoretical results like consistency, asymptotic normality etc) would be a pain with penalties like that.
Can I say something about the propensity to overfit in (A) versus (B)?
Provided that both grids cover a sufficient range, grid fineness doesn't really have anything to do with overfitting in this problem (though a coarse grid might underfit if it skips over a profitable interval). It's not as if testing too many values will somehow change what out-of-sample looks like.* In the case of these penalized regressions, we definitely want to optimize our penalized likelihood function for values $\lambda$, and it doesn't matter how many values of $\lambda$ we test, because out-of-sample performance for a fixed data set and fixed partitioning is entirely deterministic. More to the point, the out-of-sample metric is not at all altered by how many values $\lambda$ you test. A coarser grid might mean that you skip over the absolute minimum in your out-of-sample metric, but finding the absolute minimum probably isn't desirable in the first place because hyperparameters tend to be poorly estimated, and finite sample properties mean that data limitations will be a source noise in that estimate that will overwhelm slight changes in the distance between adjacent grid points: the standard error of your estimate will tend to swamp differences in grid fineness.
If you're truely concerned that out-of-sample performance metric might be overly optimistic, you could adopt the 1 standard error rule, which picks the most regularized model within 1 standard error of the minimum. That way, you're being slightly more conservative and picking a less complex model.
Can I determine the optimal grid fineness? How?
The LARS algorithm does not a priori define which values of $\lambda$ to check; rather, $\lambda$ is changed continuously and the algorithm checks for values of $\lambda$ for which a coefficient goes from 0 to a nonzero value. Those values of $\lambda$ where a new coefficient is nonzero are retained, with the observation that coefficient paths are piecewise linear in the case of the lasso, so there's no loss of information by just storing off the knots in that case. LARS only works when coefficient paths are piecewise linear, though. The ridge penalty never shrinks a coefficient to precisely zero, so all of your coefficient paths are smooth and always nonzero; likewise elastic net regressions (excluding the case of elastic net regressions which are also lasso regressions).
But most people use GLMNET because it's often faster. In terms of determining what grid of $\lambda$ to search over, I recommend reading the GLMNET article "Regularization Paths for Generalized Linear Models via Coordinate Descent" by Jerome Friedman, Trevor Hastie, and Rob Tibshirani. In it, they develop a very efficient algorithm for estimating ridge, lasso and elastic net regressions. The algorithm checks for a value of $\lambda_\text{max}$ for which $\beta$ is the zero vector, and then identifies a minimum value $\lambda_\text{min}$ relative to $\lambda_\text{max}$. Finally, they generate a sequence of values between the two uniformly on the log scale. This grid is sufficient for most purposes, though it does omit the property that you will know precisely when a coefficient is estimated at a nonzero value. Warm starts are used to provide solutions much more quickly, and it supports many common GLMs.
*You might be thinking about this from the perspective of an artificial neural network, where early stopping is sometimes used to accomplish regularization, but that's an entirely unrelated problem (namely, that the optimization algorithm is prevented from reaching an optimum, so the model is forced to be less complex).
Best Answer
Yes, you should taking into account the scale of the output $y$ and should also take into account the scale of the covariates in $X$.
Let $X \in \mathbb{R}^{n \times p}$ be the design matrix, whose rows are vectors with each entry being a covariate that together seek to explain the response $y \in \mathbb{R}^n$. Each entry of the response $y_i = f(e_i^T X) + \epsilon_i$ (for $i = 1, \dots, n$) is additively composed of a signal that depends on the covariates and an iid mean zero noise. Choosing to model the signal $f$ as being approximately linear leads us to the LASSO estimate $$\hat \beta_\lambda = \arg\min_\beta \frac{1}{2n} \|y-X\beta\|_2^2 + \lambda \|\beta\|_1,$$ we know, by first order conditions, that $\frac{-1}{n} X^T (y - X \hat \beta_\lambda) = \lambda \hat{z}_\lambda$, where $\hat{z}_\lambda$ is the dual variable satisfying $\hat{z}_{\lambda,j} = sgn(\hat{\beta}_{\lambda, j})$ if $\hat{\beta}_{\lambda, j} \neq 0$ and $\hat{z}_{\lambda, j} \in [-1,1]$ if $\hat{\beta}_{\lambda, j} = 0$.
Plugging in $\hat{\beta}_\lambda = 0$ into this equation, we see that $\frac{-1}{n} X^T y = \lambda \hat{z}_\lambda$, making $$\frac{1}{n} \|X^T y \|_\infty = \lambda \|\hat{z}_{\lambda}\|_\infty.$$
If $\|\hat{z}_\lambda\|_\infty \neq 1$, then $\lambda$ could decrease (with $\|\hat{z}_\lambda\|_\infty$ increased to maintain equality) and the LASSO estimate would still be $\hat{\beta}_\lambda = 0$. Therefore, at $\lambda_\mathrm{max}$, the smallest value of $\lambda$ that produces $\hat{\beta}_{\lambda}=0$, we get that $$\frac{1}{n} \|X^T y\|_\infty = \lambda_\mathrm{max} \cdot 1.$$
This tells us that there's no need to consider $\lambda > \lambda_\mathrm{max}$ when tuning the LASSO. Now, in practice, most solvers standardize the columns of $X$ so that won't need to be directly taken into account. (Note that it's reasonable to standardize the covariates since the units of measurement shouldn't affect the estimated coefficient.)
The ridge case is discussed well here: Maximum penalty for ridge regression