[Math] the derivation of the derivative of softmax regression (or multinomial logistic regression)

machine learningmultivariable-calculusoptimization

Consider the training cost for softmax regression (I will use the term multinomial logistic regression):

$$ J( \theta ) = – \sum^m_{i=1} \sum^K_{k=1} 1 \{ y^{(i)} = k \} \log p(y^{(i)} = k \mid x^{(i)} ; \theta) $$

according to the UFLDL tutorial the derivative of the above function is:

$$ \bigtriangledown_{ \theta^{(k)} }J( \theta ) = -\sum^{m}_{i=1} [x^{(i)} (1 \{ y^{(i)} = k \} – p(y^{(i)} = k \mid x^{(i)} ; \theta) ) ] $$

however, they didn't include the derivation. Does someone know what the derivation is?

I have tried taking the derivative of it but even my initial steps seems to disagree with the final form they have.

So I first took the gradient $\bigtriangledown_{ \theta^{(k)} }J( \theta )$ as they suggested:

$$ \bigtriangledown_{ \theta^{(k)} } J( \theta ) = – \bigtriangledown_{ \theta^{(k)} } \sum^m_{i=1} \sum^K_{k=1} 1 \{ y^{(i)} = k \} \log p(y^{(i)} = k \mid x^{(i)} ; \theta) $$

but since we are taking the gradient with respect to $\theta^{(k)}$, only the term that matches this specific k will be non-zero when we taking derivatives. Hence:

$$ \bigtriangledown_{ \theta^{(k)} } J( \theta ) = – \sum^m_{i=1} \bigtriangledown_{ \theta^{(k)} } \log p(y^{(i)} = k \mid x^{(i)} ; \theta) $$

then if we proceed we get:

$$ – \sum^m_{i=1} \frac{1}{p(y^{(i)} = k \mid x^{(i)} ; \theta)} \bigtriangledown_{ \theta^{(k)} } p(y^{(i)} = k \mid x^{(i)} ; \theta) $$

however, at this point the equation looks so different from what the UDFL tutorial has plus the indicator function disappeared completely, that it makes me suspect that I probably made a mistake somewhere. On top of that it seems that the final derivative has difference, but I don't see any differences/subtractions on my derivation. I suspect a difference might come in when expressing the Quotient rule but the indicator function disappearing still worries me. Any ideas?

Best Answer

I encountered the same problem and tackled it after understanding the algorithm. Here I'd like to explain a bit.

$$ J( \theta ) = - \sum^m_{i=1} \sum^K_{k=1} 1 \{ y^{(i)} = k \} \log p(y^{(i)} = k \mid x^{(i)} ; \theta) $$

Just imagine the result as a 2D matrix with each item(i, k) being a probability, each raw being a coefficient vector for the ith sample(or observation i) and each column being a category(k).

Let's expand p and make the equation to this:

$ J( \theta ) = - \sum^m_{i=1} \sum^K_{k=1} 1 \{y^{(i)}=k\} \log \frac{e^{\theta^T_k x^{(i)}}}{\sum^K_{l=1}e^{\theta^T_l x^{(i)}}}$

And rearranging gives:
$ J( \theta ) =- \sum^m_{i=1} \sum^K_{k=1} 1 \{y^{(i)}=k\}[\log e^{\theta^T_kx^{(i)}}-\log \sum^K_{l=1}e^{\theta^T_lx^{(i)}}]$

The partial derivative of $J$ with respect to $\theta_k$ is (treat $1 \{y^{(i)}=k\}$ as a constant):

$ \bigtriangledown_{ \theta_k }J( \theta ) = - \sum^m_{i=1} 1 \{y^{(i)}=k\}[x^{(i)}-\underbrace{\frac{1}{\sum^K_{l=1}e^{\theta^T_lx^{(i)}}}e^{\theta^T_kx^{(i)}}}_{p(y^{(i)} = k \mid x^{(i)} ; \theta)}x^{(i)}] + 1\{y^{(i)} \neq k\}[-\underbrace{\frac{1}{\sum^K_{l=1}e^{\theta^T_lx^{(i)}}}e^{\theta^T_kx^{(i)}}}_{p(y^{(i)} = k \mid x^{(i)} ; \theta)}x^{(i)}]$

Note that for the kth category only one element in $\bigtriangledown_{ \theta_k} \sum^K_{l=1}e^{\theta^T_lx^{(i)}}$ is nonzero, that is $e^{\theta^T_kx^{(i)}}$.

Then we replace p back and get:

$$ \bigtriangledown_{ \theta_k }J( \theta ) = -\sum^{m}_{i=1} [x^{(i)} (1 \{ y^{(i)} = k \} - p(y^{(i)} = k \mid x^{(i)} ; \theta) ) ] $$

Related Question