Machine Learning Boosting – Solving XGBoost Objective Derivation Problem

boostingcross entropyderivativeensemble learningmachine learning

This is the loss function of XGBoost.

enter image description here

This is the Second-order approximation of the loss function.
enter image description here

Note:

\begin{equation} L^{(t)} \text{: cross entropy loss function.} \end{equation}

\begin{equation} y_i \text{: label. Marked as y, as a matter of convenience.} \end{equation}

\begin{equation} \hat{y_i}^{(t-1)} \text{: probability of previous tree. Marked as p, as a matter of convenience.} \end{equation}

\begin{equation} f_t(x_i) \text{: output value of the tree } \end{equation}

enter image description here

\begin{equation} g_i = \frac{ ∂l(y, p) }{∂p} = \frac{p-y}{p(p-1)} \end{equation}

enter image description here

\begin{equation} h_i = \frac{ ∂l(y, p) }{∂^2p} = \frac{\left(y-p\right)p+y\left(p-1\right)}{\left(p-1\right)^2p^2} \end{equation}

Why does everyone in the gi and hi calculation do :

\begin{equation} g_i = \frac{ ∂l(y, p) }{∂x} = p-y \end{equation}

\begin{equation} h_i = \frac{ ∂l(y, p) }{∂^2x} = p(1-p) \end{equation}

NOTE:
the x is:

\begin{equation} x = ln(\frac{p}{1 – p}) \end{equation}

and the yi hat (t-1) is sigmoid function:

\begin{equation} p = \frac{e^x}{1+e^x} = \frac{1}{1+e^{-x}} \end{equation}

Why are the derivatives are with respect to x (or y_hat in the code) instead of p?
\begin{equation} g_i = \frac{ ∂l(y, p)}{∂p} => \frac{ ∂l(y, p) }{∂x} \end{equation}

Best Answer

Since you mention probabilities, I assume you are thinking about a binary classification, in which case the trees all operate in the log-odds space, the $x$s in your notation. So the $\hat{y}$ and $f_t$ are actually log-odds not probabilities.

See also What is the "binary:logistic" objective function in XGBoost?

Related Question