[Math] Deriving the gradient vector of a Probit model

Consider the probit regression model where the pdf of $y_{i}$ is
$$f(y_{i};\mathbf{\beta}) = \mu_{i}^{y_{i}}\left ( 1 – \mu_{i} \right )^{1 – y_{i}},$$ where $y_{i} = \left \{ 0,1 \right \}$ and $\mu_{i} = \Phi \left ( \mathbf{x_{i}' \beta} \right )$.

Note: $\Phi$ represents a cumulative normal distribution and limits the probability of $y_{i}$ to be within $\left [ 0,1 \right ]$, and $\beta$ represents a vector of coefficients.

I want to find the log-likelihood function and derive its gradient.

Writing in the log-likelihood form is quite straightforward:
$$\text{log}L = \sum_{i=1}^{n} \left [ y_{i}\text{log}\Phi\left ( \mathbf{x_{i}' \beta} \right ) + \left ( 1 – y_{i} \right )\text{log}\left ( 1 – \Phi\left ( \mathbf{x_{i}' \beta} \right ) \right )\right ] $$

Next we apply the fact that $$\frac{\partial \Phi}{\partial a} = \phi (a),$$
where $\phi$ is the marginal normal distribution.

The gradient vector can be derived as follows
$$\frac{\partial \text{log}L}{\partial \beta} = \sum_{i=1}^{n}\left [ y_{i}\frac{\phi\left ( \mathbf{x_{i}' \beta} \right )}{\Phi\left ( \mathbf{x_{i}' \beta} \right )} + \left ( 1 – y_{i} \right )\frac{\phi\left ( \mathbf{x_{i}' \beta} \right )}{1 – \Phi\left ( \mathbf{x_{i}' \beta} \right )} \right ]\mathbf{x_{i}}$$

Now what I don't understand is how the $\mathbf{x_{i}}$ is sticking out at the end. If I differentiate $\phi\left ( \mathbf{x_{i}' \beta} \right )$ w.r.t to $\beta$ then shouldn't I be getting $\mathbf{x_{i}'}$ from the Chain rule?

Btw, to be clear with notation $\mathbf{x_{i}'}$ is the transpose of $\mathbf{x_{i}}$.

On a side note I know that my question is very similar to Log-likelihood gradient and Hessian but it is difficult to follow what is going on there because of the different notation.

Best Answer

Note that $\mathbf{x_i}'\beta = \beta_0 + \beta_1x_1 + \cdots \beta_p x_p$, hence when you take derivative w.r.t. $\beta_k$ you have only $x_k$, i.e., the $k+1$ entry of the gradient is $$ (\nabla l)_k = \sum_i y_i \frac{\phi(\cdot)}{\Phi(\cdot)}x_{ki} + \sum_i (1 - y_i) \frac{ \phi( \cdot ) }{ 1 - \Phi(\cdot)} x_{ki} = 0 $$ i.e., you have $p+1$ (+ intercept) such equations, where each one is multiplied by the appropriate variable $x_k$, $k=0,...,p$, i.e., the a vector is of size $p+1$. The $k$th equation is a dot product
$$ \left[y_1 \frac{\phi(\cdot)}{\Phi(\cdot)}+ (1 - y_i) \frac{ \phi( \cdot ) }{ 1 - \Phi(\cdot)}, ..., y_n \frac{\phi(\cdot)}{\Phi(\cdot)} + (1 - y_n) \frac{ \phi( \cdot ) }{ 1 - \Phi(\cdot)}\right]_{1 \times n} [x_{1k},...,x_{nk} ] ^ T_{ n \times 1} $$ where in the design matrix $X$, $\mathbf{x}_k$ are stored in columns, i.e., $ X = \mathbf{ [1^T, x_1 ^T,...,x_p^T]}$, which is exactly what you nedd: $n$ rows (observations) and $p+1$ columns, i.e., you are multiplyin by $X_i$, where $ X_i = [1, x_{i1},...,x_{ip}] ^ T$.

Best Answer

Related Solutions

[Math] Minimization with complex gradient descent

[Math] Dot product with the gradient of a vector

Related Question