Solved – Information out of the hat matrix for logistic regression

logisticregression

It is clear to me, and well explained on multiple sites, what information the values on the diagonal of the hat matrix give for linear regression.

The hat matrix of a logistic regression model is less clear to me. Is it identical to the information you get out of the hat matrix applying linear regression? This is the definition of the hat matrix I found on another topic of CV (source 1):

$H=VX ( X'V X)^-1 X' V$

with X the vector of predictor variables and V is a diagonal matrix with $\sqrt{(π(1−π))}$.

Is it, in other words, also true that the particular value of the hat matrix of an observation also just presents the position of the covariates in the covariate space, and has nothing to do with the outcome value of that observation?

This is written in the book "Categorical data analysis" of Agresti:

The greater an observation’s lever- age, the greater its potential
influence on the fit. As in ordinary regression, the leverages fall
between 0 and 1 and sum to the number of model parameters. Unlike
ordinary regression, the hat values depend on the fit as well as the
model matrix, and points that have extreme predictor values need not
have high leverage.

So out of this definition, it seems we can not use it as we use it in ordinary linear regression?

Source 1: How to calculate the hat matrix for logistic regression in R?

Best Answer

Let me change the notation a bit and write the hat matrix as $$H = V^{\frac{1}{2}}X(X'VX)^{-1}X'V^{\frac{1}{2}}$$ where $V$ is a diagonal symmetric matrix with general elements $v_j = m_j \pi (x_j) \left[1 - \pi (x_j) \right]$. Denote $m_j$ as the groups of individuals with the same covariate value $x = x_j$. You can obtain the $j^{th}$ diagonal element ($h_j$) of the hat matrix as $$h_j = m_j \pi (x_j) \left[1 - \pi (x_j) \right] x'_j (X'VX)^{-1}x'_j$$ Then the sum of $h_j$ gives the number of parameters as in linear regression. Now to your question:

The interpretation of the leverage values in the hat matrix depends on the estimated probability $\pi$. If $0.1 < \pi < 0.9$, you can interpret the leverage values in a similar fashion as in the linear regression case, i.e. being further away from the mean gives you higher values. If you are in the extreme ends of the probability distribution, these leverage values might not measure distance anymore in the same sense. This is shown in the figure below taken from Hosmer and Lemeshow (2000):

enter image description here

In this case the most extreme values in the covariate space can give you the smallest leverage, which is contrary to the linear regression case. The reason is that leverage in linear regression is a monotonic function, which is not true for the non-linear logistic regression. There is a monotonically increasing part in the above formulation of the diagonal elements of the hat matrix which represents distance from the mean. That is the $x'_j (X'VX)^{-1}x'_j$ part, which you might look at if you are only interested in distance per se. The majority of diagnostic statistics for logistic regressions utilize the full leverage $h_j$, so this separate monotonic part is rarely considered alone.

If you want to read deeper into this topic, have a look at the paper by Pregibon (1981), who derived the logistic hat matrix, and the book by Hosmer and Lemeshow (2000).

Pregibon, D. (1981) "Logistic regression diagnostics", Annals of Statistics, Vol. 9(4), pp. 705-724
Hosmer, D.W. and Lemeshow, S. (2000) "Applied Logistic Regression", 2nd Edition, John Wiley and Sons, Inc.

Related Solutions

Solved – Linear model trace of the hat matrix in R

In R, the model lm(dist ~ speed, cars) includes an intercept term automatically. So you have actually fitted $dist = \beta_0 + \beta_1 speed + \epsilon$.

You rarely want to drop the intercept term but if you did, you can do lm(dist ~ 0 + speed, cars) or even lm(dist ~ speed - 1, cars). See this question on Stack Overflow for more details.

Without the intercept term, we get the result that matches what you got "by hand":

> data(cars)
> mod <- lm(dist ~ 0 + speed, cars)
> sum(lm.influence(mod)$hat)
[1] 1

You can see this geometrically by considering the hat matrix as an orthogonal projection onto the column space of $X$. With the intercept term in place, the column space of $X$ is spanned by $\mathbf{1}_n$ and the vector of observations for your explanatory variable, so forms a two-dimensional flat. An orthogonal projection onto a two-dimensional space has rank 2 - set up your basis vectors so that two lie in the flat and the others are orthogonal to it, and its matrix representation simplifies to $H = \text{diag}(1,1,0,0,...,0). $ This clearly has rank 2 and trace 2; note that both of these are preserved under change of basis, so apply to your original hat matrix too.

Dropping the intercept term is like dropping the $\mathbf{1}_n$ from the spanning set, so now your column space of $X$ is one-dimensional. By a similar argument the hat matrix will have rank 1 and trace 1.

Solved – Matrix notation for logistic regression

In linear regression the Maximize Likelihood Estimation (MLE) solution for estimating $x$ has the following closed form solution (assuming that A is a matrix with full column rank):

$$\hat{x}_\text{lin}=\underset{x}{\text{argmin}} \|Ax-b\|_2^2 = (A^TA)^{-1}A^Tb$$

This is read as "find the $x$ that minimizes the objective function, $\|Ax-b\|_2^2$". The nice thing about representing the linear regression objective function in this way is that we can keep everything in matrix notation and solve for $\hat{x}_\text{lin}$ by hand. As Alex R. mentions, in practice we often don't consider $(A^TA)^{-1}$ directly because it is computationally inefficient and $A$ often does not meet the full rank criteria. Instead, we turn to the Moore-Penrose pseudoinverse. The details of computationally solving for the pseudo-inverse can involve the Cholesky decomposition or the Singular Value Decomposition.

Alternatively, the MLE solution for estimating the coefficients in logistic regression is:

$$\hat{x}_\text{log} = \underset{x}{\text{argmin}} \sum_{i=1}^{N} y^{(i)}\log(1+e^{-x^Ta^{(i)}}) + (1-y^{(i)})\log(1+e^{x^T a^{(i)}})$$

where (assuming each sample of data is stored row-wise):

$x$ is a vector represents regression coefficients

$a^{(i)}$ is a vector represents the $i^{th}$ sample/ row in data matrix $A$

$y^{(i)}$ is a scalar in $\{0, 1\}$, and the $i^{th}$ label corresponding to the $i^{th}$ sample

$N$ is the number of data samples / number of rows in data matrix $A$.

Again, this is read as "find the $x$ that minimizes the objective function".

If you wanted to, you could take it a step further and represent $\hat{x}_\text{log}$ in matrix notation as follows:

$$ \hat{x}_\text{log} = \underset{x}{\text{argmin}} \begin{bmatrix} 1 & (1-y^{(1)}) \\ \vdots & \vdots \\ 1 & (1-y^{(N)})\\\end{bmatrix} \begin{bmatrix} \log(1+e^{-x^Ta^{(1)}}) & ... & \log(1+e^{-x^Ta^{(N)}}) \\\log(1+e^{x^Ta^{(1)}}) & ... & \log(1+e^{x^Ta^{(N)}}) \end{bmatrix} $$

but you don't gain anything from doing this. Logistic regression does not have a closed form solution and does not gain the same benefits as linear regression does by representing it in matrix notation. To solve for $\hat{x}_\text{log}$ estimation techniques such as gradient descent and the Newton-Raphson method are used. Through using some of these techniques (i.e. Newton-Raphson), $\hat{x}_\text{log}$ is approximated and is represented in matrix notation (see link provided by Alex R.).

Best Answer

Related Solutions

Solved – Linear model trace of the hat matrix in R

Solved – Matrix notation for logistic regression

Related Question