Solved – Matrix notation for logistic regression

linear modellogisticnotationregression

In linear regression (squared loss), using matrix we have a very concise notation for the objective

$$\text{minimize}~~ \|Ax-b\|^2$$

Where $A$ is the data matrix, $x$ is the coefficients, and $b$ is the response.

Is there similar a matrix notation for logistic regression objective? All the notations I have seen cannot get rid of the sum over all data points (something like $\sum_{\text data} \text{L}_\text{logistic}(y,\beta^Tx)$).


EDIT: thanks for joceratops and AdamO's great answer. Their answer helped me to realize that another reason linear regression have a more concise notation is because the definition of the norm, which encapsulate the square and the sum or $e^\top e$. But in logistic loss, there is not such definition, which makes notation a little bit more complicated.

Best Answer

In linear regression the Maximize Likelihood Estimation (MLE) solution for estimating $x$ has the following closed form solution (assuming that A is a matrix with full column rank):

$$\hat{x}_\text{lin}=\underset{x}{\text{argmin}} \|Ax-b\|_2^2 = (A^TA)^{-1}A^Tb$$

This is read as "find the $x$ that minimizes the objective function, $\|Ax-b\|_2^2$". The nice thing about representing the linear regression objective function in this way is that we can keep everything in matrix notation and solve for $\hat{x}_\text{lin}$ by hand. As Alex R. mentions, in practice we often don't consider $(A^TA)^{-1}$ directly because it is computationally inefficient and $A$ often does not meet the full rank criteria. Instead, we turn to the Moore-Penrose pseudoinverse. The details of computationally solving for the pseudo-inverse can involve the Cholesky decomposition or the Singular Value Decomposition.

Alternatively, the MLE solution for estimating the coefficients in logistic regression is:

$$\hat{x}_\text{log} = \underset{x}{\text{argmin}} \sum_{i=1}^{N} y^{(i)}\log(1+e^{-x^Ta^{(i)}}) + (1-y^{(i)})\log(1+e^{x^T a^{(i)}})$$

where (assuming each sample of data is stored row-wise):

$x$ is a vector represents regression coefficients

$a^{(i)}$ is a vector represents the $i^{th}$ sample/ row in data matrix $A$

$y^{(i)}$ is a scalar in $\{0, 1\}$, and the $i^{th}$ label corresponding to the $i^{th}$ sample

$N$ is the number of data samples / number of rows in data matrix $A$.

Again, this is read as "find the $x$ that minimizes the objective function".

If you wanted to, you could take it a step further and represent $\hat{x}_\text{log}$ in matrix notation as follows:

$$ \hat{x}_\text{log} = \underset{x}{\text{argmin}} \begin{bmatrix} 1 & (1-y^{(1)}) \\ \vdots & \vdots \\ 1 & (1-y^{(N)})\\\end{bmatrix} \begin{bmatrix} \log(1+e^{-x^Ta^{(1)}}) & ... & \log(1+e^{-x^Ta^{(N)}}) \\\log(1+e^{x^Ta^{(1)}}) & ... & \log(1+e^{x^Ta^{(N)}}) \end{bmatrix} $$

but you don't gain anything from doing this. Logistic regression does not have a closed form solution and does not gain the same benefits as linear regression does by representing it in matrix notation. To solve for $\hat{x}_\text{log}$ estimation techniques such as gradient descent and the Newton-Raphson method are used. Through using some of these techniques (i.e. Newton-Raphson), $\hat{x}_\text{log}$ is approximated and is represented in matrix notation (see link provided by Alex R.).