[Math] the role of the $L_2$-norm in optimization

optimization

I'm a little confused about the role of the $L_2$-norm in optimization. Suppose my data is $n$-dimensional data, and I have some input pairs $(x, y)$ and a function $f(x)$ which I want to learn. For my cost function, I sum up, over all training examples, the square of the difference between $y$ and $f(x)$.

I am confused about what the $L_2$-norm part of this is that people tend to talk about. Is it:

1) You are squaring the distance between each $y$ and $f(x)$ pair, therefore this summation is the $L_2$-norm of the difference over all pairs.

2) To calculate the distance between each $y$ and $f(x)$ pair, you square the difference between each element in the $n$-dimensional vector, and square root it to find the distance. This distance is the $L_2$-norm.

Which one? Thanks!

Best Answer

If I understand your question well, then option 1 is the right one. By definition, the L2-norm (or $\ell^2$ norm) of a $n$-dimensional vector $\underline{x}$ is given by

$$ \|\underline{x}\|_{\ell^2} = \left(\sum_{i=1}^n x_i^2\right)^{\frac{1}{2}} $$

In your case, what you are minimizing is the square of the $\ell^2$ norm of $\underline{y}-\underline{f}$, where $f_i = f(x_i)$,

$$ \|\underline{y}-\underline{f}\|_{\ell^2}^2 = \sum_{i=1}^n (y_i - f(x_i))^2 $$

So the $\ell^2$ norm comes into play when you are measuring the total mismatch between data and predicted values.

Best Answer

Related Solutions

[Math] Differentiating the L1-norm

Difference between Model Predictive Control and Rolling Horizon Optimization

Related Question