Derive equation for regularized logistic regression with batch updates

bayesian-probabilitymachine learningst.statistics

I am trying to understand this paper by Chapelle and Li "An Empirical Evaluation of Thompson Sampling" (2011). In particular, I am failing to derive the equations in algorithm 3 (page 6). The first equation looks like an NLL of $p(x|w) \, p(w)$ where the latter is the prior shown in the second line of the algorithm; modulo the constant term that does not depend on $w$. The question is: what is the likelihood? Obviously, it is not the canonical cross-entropy with $y_j \in \{0, 1\}$ but almost the cross-entropy with $y_j \in \{-1, +1\}$?

Furthermore, I don't understand the update step for $q_j$ in the last line: I can derive something that comes close using the Laplace approximation,
$$q_j \leftarrow -\frac{\partial^2}{\partial w_j^2} \ln p(x, w),$$
and discarding correlations… but it is not the same and there are still some $y_j$ and other terms floating around.

Can someone tell me, how to derive these equations?

Thanks a lot!

Best Answer

I found the solution (with the help of a friend: cudos!). The posterior is $$\begin{align*} -\log p(\boldsymbol{w}|\boldsymbol{x}) &= -\log p(\boldsymbol{x}|\boldsymbol{w}) - \log p(\boldsymbol{w}) + \text{const.} \\ &= \sum\limits_j \log \left( 1 + \exp(-y_j \boldsymbol{w}^\top \boldsymbol{x}_j) \right) + \sum\limits_i \frac{q_i (w_i - m_i)^2}{2} + \text{const.}' \end{align*}$$ where the constant terms do not depend on $\boldsymbol{w}$, and with the NLL $$\begin{align*} -\log p(\boldsymbol{x}|\boldsymbol{w}) &= \sum_j \log \left[ \frac{1 + y_j}{2} \left( 1 + \mathrm{e}^{-\boldsymbol{w}^\top \boldsymbol{x}_j} \right) + \frac{1 - y_j}{2} \left( 1 + \mathrm{e}^{+\boldsymbol{w}^\top \boldsymbol{x}_j} \right) \right] \\ &= \sum\limits_j \log \left( 1 + \mathrm{e}^{-y_j\boldsymbol{w}^\top \boldsymbol{x}_j} \right) \end{align*}$$ for $y_j = \{-1, +1\}$ (!). Ignoring correlations, the Laplace approximation then yields $$\begin{align*} q_i &\leftarrow - \left. \frac{1}{p(\boldsymbol{\hat w} | \boldsymbol{x})} \frac{\partial^2 p(\boldsymbol{w} | \boldsymbol{x})}{\partial w_i^2} \right|_{\boldsymbol{w} = \boldsymbol{\hat w}} \\ &= \left. -\frac{\partial^2 \log p(\boldsymbol{w} | \boldsymbol{x})}{\partial w_i^2} \right|_{\boldsymbol{w} = \boldsymbol{\hat w}} \\ &= q_i + \sum\limits_j x_{ij}^2 \frac{1}{1 + \mathrm{e}^{-y_j \boldsymbol{\hat w}^\top \boldsymbol{x}_j}} \frac{1}{1 + \mathrm{e}^{+y_j \boldsymbol{\hat w}^\top \boldsymbol{x}_j}} \\ &= q_i + \sum\limits_j x_{ij}^2 \frac{1}{1 + \mathrm{e}^{-\boldsymbol{\hat w}^\top \boldsymbol{x}_j}} \frac{1}{1 + \mathrm{e}^{+\boldsymbol{\hat w}^\top \boldsymbol{x}_j}} \\ &= q_i + \sum\limits_j x_{ij}^2 \, p_j (1-p_j) \end{align*}$$ with $x_{ij} = (\boldsymbol{x}_j)_i$ and $$\boldsymbol{\hat w} = \underset{\boldsymbol{w}}{\operatorname{argmax}} p(\boldsymbol{x} | \boldsymbol{w}) \,.$$

Related Question