Solved – machine learning – Derivative of log-likelihood function in softmax regression

machine learningsoftmax

I'm trying to find the derivative of the log-likelihood function in softmax regression. I have (with $\Theta$ being the parameters, and $x^{(i)}$ being the $i$th training example, and $s_j$ representing the softmax function),

$$\ell(\Theta) = \sum_{i=1}^m \sum_{j=1}^k \log \left( \frac{e^{\Theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{\Theta_l^T x^{(i)}}} \right)^{I\{y^{(i)}=j\}}$$
I got the derivative of the softmax function itself as
$$\frac{\partial}{\partial \Theta_p} \left( \frac{e^{\Theta^T_j x^{(i)}}}{\sum_{l=1}^k e^{\Theta^T_l x^{(i)}}} \right)=s_j(\delta_{pj}-s_p)x^{(i)}$$
On using this to find the derivative of the log-likelihood, I get
$$\begin{aligned}\frac{\partial}{\partial \Theta_p}\ell(\Theta) &= \sum_{i=1}^m \sum_{j=1}^k I\{y^{(i)}=j\} \cdot \frac{s_j(\delta_{pj}-s_p)x^{(i)}}{s_j} \\ &= \sum_{i=1}^m \sum_{j=1}^k I\{y^{(i)}=j\} \cdot (\delta_{pj}-s_p)x^{(i)} \\ &= \sum_{i=1}^m \sum_{j=1}^k I\{y^{(i)}=p\}x^{(i)}-I\{y^{(i)}=j\}s_p x^{(i)} \end{aligned} $$
I'm not sure where to go from here. From what I've seen online, I shouldn't have the second summation at all, and the second term should just be $s_p$.

I'm sure I'm just missing a step or two, but I'd love if someone could help me move ahead.

Best Answer

You are correct up to the second line of your working in the last part, and then you make an error by dropping the requirement that $j=p$ (which means you retain an additional sum that shouldn't be there). Continuing from your last correct step, you should have:

$$\begin{aligned} \frac{\partial \ell}{\partial \Theta_p}(\Theta) &= \sum_{i=1}^m \sum_{j=1}^k \mathbb{I}(y^{(i)}=j) (\delta_{pj}-s_p) x^{(i)} \\[6pt] &= \sum_{i=1}^m x^{(i)} \sum_{j=1}^k \mathbb{I}(y^{(i)}=j) (\delta_{pj}-s_p) \\[6pt] &= \sum_{i=1}^m x^{(i)} \bigg[ \sum_{j=1}^k \mathbb{I}(y^{(i)}=j) \mathbb{I}(p=j) - s_p \sum_{j=1}^k \mathbb{I}(y^{(i)}=j) \Bigg] \\[6pt] &= \sum_{i=1}^m x^{(i)} \bigg[ \mathbb{I}(y^{(i)}=p) - s_p \Bigg] \\[6pt] &= \sum_{i=1}^m x^{(i)} \mathbb{I}(y^{(i)}=p) - s_p \sum_{i=1}^m x^{(i)}. \\[6pt] \end{aligned} $$

(The penultimate step follows from the fact that $\sum_{j=1}^k \mathbb{I}(y^{(i)}=j) = 1$ for all $i = 1, ..., m$.)