Logistic Regression – Result of Second-Order Beta Partial Derivative

logisticpropensity-scores

For logistic regression model, the log-likelihood function can be written as
$$\ell(\beta)=\sum_{i=1}^N(y_i\beta^Tx_i-\log(1+e^{\beta^Tx_i}))$$
This is a $(p+1)$ nonlinear equations in $\beta$. To solve this problem, we use the Newton-Raphson algorithm, which requires the second-derivative
$$\frac{\partial^2 \ell(\beta)}{\partial \beta \partial \beta^T}=-\sum_{i=1}^Nx_ix_i^Tp(x_i;\beta)(1-p(x_i;\beta)) \,(*)$$
where $p(x_i;\beta)=1-\frac{1}{1+e^{\beta^Tx_i}}$.

My question is how to get (*)?

I know that
$$\frac{\partial \ell}{\partial \beta }=\sum_i (y_ix_i-\frac{e^{\beta^Tx_i}x_i}{1+e^{\beta^Tx_i}})\in R^{(p+1)\times 1}$$

But what is $$\frac{\partial^2 \ell(\beta)}{\partial \beta \partial \beta^T}?$$

I get
$$\frac{\partial^2 \ell(\beta)}{\partial \beta \partial \beta^T}=\frac{\partial}{\partial \beta^T }(\sum_i (y_ix_i-\frac{e^{\beta^Tx_i}x_i}{1+e^{\beta^Tx_i}}))=?$$

Best Answer

A very good resource for many of the differentiation rules needed to derive these statistical methods is the matrix cookbook. In particular, I find it very helpful for keeping track of what does and does not get transposed when taking derivatives. In your case, we can take the ideas behind @Demetri Pananos's answer to its conclusion.

Let us instead work with $p(z) \equiv \frac{e^z}{1+e^z}$. First, note that we have $p'(z) = p(z) (1-p(z))$. Note that thinking about $p(z)$ as a probability gives a nice intuition for what this derivative corresponds to: it's nothing more than the variance of a Bernoulli random variable with probability $p(z)$! This intuition is so important that I will henceforth call $v(z) = p(z)(1-p(z))$. In other words, when we plug in $z=\beta^T x_i$ and $\beta$ is the true value, then this derivative is precisely the variance of a single observation conditional on $x_i$. Now, letting $f_i:\mathbb R\to \mathbb R^{p+1}$ be given by $p(z) x_i$, we have $f_i'(z) = v(z) x_i$, so combining this with the chain rule, we have $\frac{\partial p(\beta^T x_i)}{\partial \beta^T} = v(\beta^T x_i)\cdot x_i x_i^T$. Thus, we conclude that $$\frac{\partial^2 \ell}{\partial \beta \partial \beta^T} = \sum_i v(\beta^T x_i)\cdot x_i x_i^T$$

Rather than dwelling on the details of the computation though, I think it is helpful to get some intuition for why the resultant Newton's method algorithm is sensible not just from a computational point of view but in fact from a statistical point of view. A few observations. First, let us rewrite the score function as $$\frac{\partial \ell}{\partial \beta^T} = \sum_i \underbrace{(y_i - p(\beta^T x_i))}_{\text{Residuals}}\cdot\underbrace{x_i^T}_{\text{Regressors}}$$ The first order condition of logistic regression therefore has the same form (and is due to the same basic intuition) as the first order condition for OLS: to get an "optimal" predictor, you cannot have your regressors to still be correlated with your residuals since otherwise, you could clearly move $\beta$ in the direction of that correlation to get an even better predictor.

The only tricky aspect of the score function is that $p$ introduces a non-linearity, but by taking a Taylor approximation, we can temporarily ignore this issue. Specifically, let $\beta_j$ be a guess of $\beta$ which is not necessarily yet optimal, in the sense that $\frac{\partial \ell}{\partial \beta}\big|_{\beta = \beta_j} \neq 0$. Let us now take a Taylor approximation to determine how different values of $\beta_{j+1}$ might improve our situation: $$\begin{aligned}\frac{\partial \ell}{\partial \beta}\Bigg |_{\beta = \beta_{j+1}} &\approx \sum_i \left[y_i - p(\beta_j^T x_i) - v(\beta_j^T) (\beta_{j+1}^T - \beta_j^T)x_i\right] x_i^T\\ & = \sum_i \underbrace{v(\beta_j^T)}_{\text{"Weights"}}\cdot \underbrace{\left[\frac{y_i - p(\beta_j^T x_i)}{v(\beta_j^T x_i)} - (\beta_{j+1} - \beta_j)^Tx_i\right]}_{\text{"Approximate Residuals"}} \cdot \underbrace{x_i^T}_{\text{Regressors}}\end{aligned}$$ Thus, to first order approximation, our update to $\beta$, $\beta_{j+1} - \beta_j$, is given by the solution a particular weighted least squares problem, whose well known solution is given by $$\begin{aligned}\beta_{j+1}-\beta_j = \left(\sum_i \underbrace{x_ix_i^T}_{\text{"Regressor squared"}}\underbrace{v(\beta^T x_i)}_{\text{"Weights"}}\right)^{-1}\left(\sum_i \underbrace{x_i\frac{(y_i - p(\beta^T x_i))}{v(\beta^T x_i)}}_{\text{"Outcome times regressor"}}\underbrace{v(\beta^T x_i)}_{\text{"Weights"}}\right)\end{aligned}$$ You can check that this weighted least squares view exactly recovers the Newton's method iteration. Moreover, note that the weights of the weighted regression are quite sensible: they are proportional to the predicted variance of $y_i$ given $x_i$, and hence correspond to the intuition that at each step, we want to spend proportionately more of our effort getting it right for those observations that we think are especially hard to predict.

Related Question