I'm reading Taylor Cross Entropy Loss
paper and came across the formulation of Mean Absolute Error (MAE), which is described as following:
$$
\mathcal{L}_{MAE}(f({\bf x}), y) = \Vert{e_y – f({\bf x})}\Vert_{1} = 2 – 2f_y({\bf x})
$$
As mentioned in the linked paper's section 3.1, $e_y$ is one hot encoded vector having same dimension as $f({\bf x})$.
What I don't understand is that from where does the constant 2
come from in above formulation? I'm referring to this formulation if MAE
on Wikipedia.
Any hint/reference would be appreciated. Thanks.
Edit
As requested in the comment, here is the context of the above formulation.
- The linked paper talks about training a deep neural network for
k
class classification. ${\bf x}$ is a feature vector (e.g. image of cat) and $y$ is ground truth label. - The neural network is represented as an unknown complex function $f$.
- $e_y$ is one hot encoded vectors of ground truths. For example, if
k=2
(cat and dog), then $e_y = [0, 1]$ for cat and $e_y = [1, 0]$ for dog. - $f({\bf x})$ is what predicted by neural network. It can be probabilities for each class. For example, a well trained neural network will output $f({\bf x}) = [0.05, 0.95]$ for an image of cat.
- The mean absolute difference between ground truth labels and the predicted outputs for all images is what I refer to
MAE
.
Best Answer
Suppose both $e_y$ and $f(x)$ are vectors where a single element has value $1$ and all other elements have value $0$. Therefore, $\Vert{e_y - f({\bf x})}\Vert_{1}=0$ if the position of the $1$ is the same in both vectors, and $\Vert{e_y - f({\bf x})}\Vert_{1}=2$ if the position is not the same. For $e_y$, the $1$ is at position $y$, so if $(f({\bf x}))_y$ is also $1$, the error is $0$.
Interestingly, the formula also holds when the assumption on $f(x)$ is relaxed to assuming that the elements are nonnegative and sum to $1$: \begin{align}\Vert{e_y - f({\bf x})}\Vert_{1} &= \sum_j |(e_y)_j - (f(x))_j| \\ &= |(e_y)_y - (f(x))_y| + \sum_{j:j\neq y} |(e_y)_j - (f(x))_j| \\ &= |1 - (f(x))_y| + \sum_{j:j\neq y} |0 - (f(x))_j| \\ &= 1 - (f(x))_y + \sum_{j:j\neq y} (f(x))_j \\ &= 1 - 2(f(x))_y + \sum_j (f(x))_j \\ &= 2 - 2(f(x))_y \end{align}