I'm a little confused about the role of the $L_2$-norm in optimization. Suppose my data is $n$-dimensional data, and I have some input pairs $(x, y)$ and a function $f(x)$ which I want to learn. For my cost function, I sum up, over all training examples, the square of the difference between $y$ and $f(x)$.
I am confused about what the $L_2$-norm part of this is that people tend to talk about. Is it:
1) You are squaring the distance between each $y$ and $f(x)$ pair, therefore this summation is the $L_2$-norm of the difference over all pairs.
2) To calculate the distance between each $y$ and $f(x)$ pair, you square the difference between each element in the $n$-dimensional vector, and square root it to find the distance. This distance is the $L_2$-norm.
Which one? Thanks!
Best Answer
If I understand your question well, then option 1 is the right one. By definition, the L2-norm (or $\ell^2$ norm) of a $n$-dimensional vector $\underline{x}$ is given by
$$ \|\underline{x}\|_{\ell^2} = \left(\sum_{i=1}^n x_i^2\right)^{\frac{1}{2}} $$
In your case, what you are minimizing is the square of the $\ell^2$ norm of $\underline{y}-\underline{f}$, where $f_i = f(x_i)$,
$$ \|\underline{y}-\underline{f}\|_{\ell^2}^2 = \sum_{i=1}^n (y_i - f(x_i))^2 $$
So the $\ell^2$ norm comes into play when you are measuring the total mismatch between data and predicted values.