Machine Learning – How Logistic Loss and Cross-Entropy are Related

information theorymachine learningprobability distributions

I found that Kullback-Leibler loss, log-loss or cross-entropy is the same loss function. Is the logistic-loss function used in logistic regression equivalent to the cross-entropy function? If yes, can anybody explain how they are related?

Thanks

Best Answer

The relationship between Cross-entropy, logistic loss and K-L divergence is quite natural and immersed in the definition itself.

Cross-entropy is defined as: \begin{equation} H(p, q) = \operatorname{E}_p[-\log q] = H(p) + D_{\mathrm{KL}}(p \| q)=-\sum_x p(x)\log q(x) \end{equation} Where, $p$ and $q$ are two distributions and using the definition of K-L divergence. $H(p)$ is the entropy of p. Now if $p \in \{y,1-y\}$ and $q \in \{\hat{y}, 1-\hat{y}\}$, we can re-write cross-entropy as: \begin{equation} H(p, q) = -\sum_x p_x \log q_x =-y\log \hat{y}-(1-y)\log (1-\hat{y}) \end{equation} which is nothing but logistic loss. Further, log loss is also related to logistic loss and cross-entropy as follows:

Expected Log loss is defined as follows: \begin{equation} E[-\log q] \end{equation} Note the above loss function used in logistic regression where q is a sigmoid function. Excess risk for the above loss function is defined as follows: \begin{equation} E[\log p - \log q ]=E[\log\frac{p}{q}]=D_{KL}(p||q) \end{equation} Notice that the K-L divergence is nothing but the excess risk of the log loss and K-L differs from Cross-entropy by a constant factor (see the first definition). One important thing to remember is that we usually minimize the log loss instead of the cross-entropy in logistic regression which is not perfectly OK but it is in practice.