Solved – error, cost, loss, risk, are those 4 terms the same in the context of machine learning

machine learningmathematical-statisticsneural networks

People uses these 4 terms when talking about deep learning, machine learning, data science.

error (such as training error, test error), cost (such as training cost, test cost), loss (such as empirical loss), risk (such as empirical risk)

Are those 4 terms the same in the context of machine learning?

Answers with authoritative reference would be better, such as published paper, text book, user guide or documentation.

Best Answer

Cost and loss are often used interchangeably. However, sometimes people deliberately use the term loss to refer to a single training example, and cost is used in the context of the whole training set. For instance, for the $i$th training example, you may have a loss function $\mathcal{L}(x_i, y_i)$, and you may write the cost as $\frac{1}{n} \sum_{i}^{n}\mathcal{L}(x_i, y_i)$.
Cost and error can be the same, depending what your loss function is. For example, say you have a logistic regression classifier, the classification cost and classification error are two different things. Here, you would define the cost via the negative log-likelihood loss, and the error is based on the 0/1 loss. The 0/1 loss is intuitive but not differentiable. In other cases, like OLS regression, you minimize the mean-squared error and can also report it as the training error. So in that case they are the same
The "risk" is usually the expected loss. However, since that's hard to compute in most cases, people use the term "empirical risk" which is basically again just averaging over the losses (so, basically similar as cost)

Your specific question...

I would disagree with your contention that the two follow "completely different notation." Both $X\boldsymbol{\beta} = \boldsymbol{y}$ and $A\mathbf{x} = \mathbf{b}$ use capital letters to denote matrices. They're not that different.
Machine learning is highly related to statistics, a large and mature field. Using $X$ to represent the data matrix is almost certainly the most readable, most standard convention to follow. While $A\mathbf{x} = \mathbf{b}$ is standard for solving linear systems, that's not how people doing statistics write the normal equations. You'll find your audience more confused if you try to do that. When in Rome...
In some sense, the heart of your revised question is, "What are the historical origins of statistics using the letter $x$ to represent data and the letter $\beta$ to represent the unknown variable to solve for?"
- This is a question for the statistical historians! Briefly searching, I see the influential British statistician and Cambridge academic Udny Yule used $x$ to represent data in his Introduction to the Theory of Statistics (1911). He wrote a regression equation as $x_1 = a + bx_2$, with the least squares objective as minimizing $\sum\left( x_1 - a - bx_2\right)^2$, and with solution $b_{12} = \frac{\sum x_1x_2}{\sum x_2^2}$. It at least goes back to then...
- The even more influential R.A. Fisher used $y$ for the dependent variable and $x$ for the independent variable in his 1925 book Statistical Methods for Research Workers. (Hat tip to @Nick Cox for providing link with info.)

Good notation is like good language. Avoid field specific jargon whenever possible. Write in the math equivalent of high BBC English, language that is understandable to most anyone that speaks English. One should write, whenever possible, using notation that is clear and that is broadly understood.

Solved – Why is it important to include a bias correction term for the Adam optimizer for Deep Learning

The problem of NOT correcting the bias
According to the paper

In case of sparse gradients, for a reliable estimate of the second moment one needs to average over many gradients by chosing a small value of β2; however it is exactly this case of small β2 where a lack of initialisation bias correction would lead to initial steps that are much larger.

Normally in practice $\beta_2$ is set much closer to 1 than $\beta_1$ (as suggested by the author $\beta_2=0.999$, $\beta_1=0.9$), so the update coefficients $1-\beta_2=0.001$ is much smaller than $1-\beta_1=0.1$.

In the first step of training $m_1=0.1g_t$, $v_1=0.001g_t^2$, the $m_1/(\sqrt{v_1}+\epsilon)$ term in the parameter update can be very large if we use the biased estimation directly.

On the other hand when using the bias-corrected estimation, $\hat{m_1}=g_1$ and $\hat{v_1}=g_1^2$, the $\hat{m_t}/(\sqrt{\hat{v_t}}+\epsilon)$ term becomes less sensitive to $\beta_1$ and $\beta_2$.

How the bias is corrected
The algorithm uses moving average to estimate the first and second moments. The biased estimation would be, we start at an arbitrary guess $m_0$, and update the estimation gradually by $m_t=\beta m_{t-1}+(1-\beta)g_t$. So it's obvious in the first few steps our moving average is heavily biased towards the initial $m_0$.

To correct this, we can remove the effect of the initial guess (bias) out of the moving average. For example at time 1, $m_1=\beta m_0+(1-\beta)g_t$, we take out the $\beta m_0$ term from $m_1$ and divide it by $(1-\beta)$, which yields $\hat{m_1}=(m_1- \beta m_0)/(1-\beta)$. When $m_0=0$, $\hat{m_t}=m_t/(1-\beta^t)$. The full proof is given in Section 3 of the paper.

As Mark L. Stone has well commented

It's like multiplying by 2 (oh my, the result is biased), and then dividing by 2 to "correct" it.

Somehow this is not exactly equivalent to

the gradient at initial point is used for the initial values of these things, and then the first parameter update

(of course it can be turned into the same form by changing the update rule (see the update of the answer), and I believe this line mainly aims at showing the unnecessity of introducing the bias, but perhaps it's worth noticing the difference)

For example, the corrected first moment at time 2

$$\hat{m_2}=\frac{\beta(1-\beta)g_1+(1-\beta)g_2}{1-\beta^2}=\frac{\beta g_1+g_2}{\beta+1}$$

If using $g_1$ as the initial value with the same update rule, $$m_2=\beta g_1+(1-\beta)g_2$$ which will bias towards $g_1$ instead in the first few steps.

Is bias correction really a big deal
Since it only actually affects the first few steps of training, it seems not a very big issue, in many popular frameworks (e.g. keras, caffe) only the biased estimation is implemented.

From my experience the biased estimation sometimes leads to undesirable situations where the loss won't go down (I haven't thoroughly tested that so I'm not exactly sure whether this is due to the biased estimation or something else), and a trick that I use is using a larger $\epsilon$ to moderate the initial step sizes.

Update
If you unfold the recursive update rules, essentially $\hat{m}_t$ is a weighted average of the gradients,
$$\hat{m}_t=\frac{\beta^{t-1}g_1+\beta^{t-2}g_2+...+g_t}{\beta^{t-1}+\beta^{t-2}+...+1}$$ The denominator can be computed by the geometric sum formula, so it's equivalent to following update rule (which doesn't involve a bias term)

$m_1\leftarrow g_1$
while not converge do
$\qquad m_t\leftarrow \beta m_t + g_t$ (weighted sum)
$\qquad \hat{m}_t\leftarrow \dfrac{(1-\beta)m_t}{1-\beta^t}$ (weighted average)

Therefore it can be possibly done without introducing a bias term and correcting it. I think the paper put it in the bias-correction form for the convenience of comparing with other algorithms (e.g. RmsProp).

Best Answer

Related Solutions

Solved – What are the classical notations in statistics, linear algebra and machine learning? And what are the connections between these notations

Your specific question...

Solved – Why is it important to include a bias correction term for the Adam optimizer for Deep Learning

Related Question