How to test for simultaneous equality of choosen coefficients in logit or probit model ? What is the standard approach and what is the state of art approach ?
Solved – How to test for simultaneous equality of choosen coefficients in logit or probit model
hypothesis testinglogitprobit
Related Solutions
I think a better way to see the marginal effect of a given variable, say $X_j$, is to produce a scatter plot of the predicted probability on the vertical axis, and to have $X_j$ on the horizontal axis. This is the most "layman" way I can think of indicating how influential a given variable is. No maths, just pictures. If you have a lot of data points, then a boxplot, or scatterplot smoother may help to see where most of the data is (as oppose to just a cloud of points).
Not sure how "Layman" the next section is, but you may find it useful.
If we look at the marginal effect, call it $m_j$, noting that $g(p)=\sum_kX_k\beta_k$, we get
$$m_j=\frac{\partial p}{\partial X_j}=\frac{\beta_j}{g'\left[g^{-1}(X^T\beta)\right]}=\frac{\beta_j}{g'(p)}$$
So the marginal effect depends on the estimated probability and the gradient of the link function in addition to the beta. The dividing by $g'(p)$, comes from the chain rule for differentiation, and the fact that $\frac{\partial g^{-1}(z)}{\partial z}=\frac{1}{g'\left[g^{-1}(z)\right]}$. This can be shown by differentiating both sides of the obviously true equation $z=g\left[g^{-1}(z)\right]$. We also have that $g^{-1}(X^T\beta)=p$ by definition. For a logit model, we have $g(p)=\log(p)-\log(1-p)\implies g'(p)=\frac{1}{p}+\frac{1}{1-p}=\frac{1}{p(1-p)}$, and the marginal effect is:
$$m_j^{logit}=\beta_jp(1-p)$$
What does this mean? well $p(1-p)$ is zero at $p=0$ and at $p=1$, and it reaches its maximum value of $0.25$ at $p=0.5$. So the marginal effect is greatest when the probability is near $0.5$, and smallest when $p$ is near $0$ or near $1$. However, $p(1-p)$ still depends on $X_j$, so the marginal effects are complicated. In fact, because it depends on $p$, you will get a different marginal effect for different $X_k,\;k\neq j$ values. Possibly one good reason to just do that simple scatter plot - don't need to chose which values of the covariates to use.
For a probit model, we have $g(p)=\Phi^{-1}(p)\implies g'(p)=\frac{1}{\phi\left[\Phi^{-1}(p)\right]}$ where $\Phi(.)$ is standard normal CDF, and $\phi(.)$ is standard normal pdf. So we get:
$$m_j^{probit}=\beta_j\phi\left[\Phi^{-1}(p)\right]$$
Note that this has most of the properties that the $m_j^{logit}$ marginal effect I discussed earlier, and is equally true of any link function which is symmetric about $0.5$ (and sane, of course, e.g. $g(p)=tan(\frac{\pi}{2}[2p-1])$). The dependence on $p$ is more complicated, but still has the general "hump" shape (highest point at $0.5$, lowest at $0$ and $1$). The link function will change the size of the maximum height (e.g. probit maximum is $\frac{1}{\sqrt{2\pi}}\approx 0.4$, logit is $0.25$), and how quickly the marginal effect is tapered towards zero.
The question of what model to use has to do with the objective of the analysis.
If the objective is to develop a classifier to predict binary outcomes, then (as you can see), these three models are all approximately the same and give you approximately the same classifier. That makes it a moot point since you don't care what model develops your classifier and you might use cross validation or split sample validation to determine which model performs best in similar data.
In inference, all models estimate different model parameters. All three regression models are special cases of GLMs which use a link function and a variance structure to determine the relationship between a binary outcome and (in this case) a continuous predictor. The NLS and logistic regression model use the same link function (the logit) but the NLS minimizes squared error in the fitting of the S curve where as the logistic regression is a maximum likelihood estimate of the model data under the assumption of the linear model for model probabilities and the binary distribution of observed outcomes. I can't think of a reason why we'd consider the NLS to be useful for inference.
Probit regression uses a different link function which is the cumulative normal distribution function. This "tapers" faster than a logit and is often used to make inference on binary data that is observed as a binary threshold of unobserved continuous normally distributed outcomes.
Empirically, the logistic regression model is used far more often for analysis of binary data since the model coefficient (odds-ratio) is easy to interpret, it is a maximum likelihood technique, and has good convergence properties.
Related Question
- Solved – Coefficients in ordered probit model
- Mixed Model vs Ordered Probit vs Ordered Logit – Comparing Linear Models with Ordinal Response
- Solved – Probit/Logit Model – How to find the parameter $\beta$
- Solved – How to choose between logit, probit or linear probability model
- Solved – How to write a logit and probit regression equation
Best Answer
Wald test
One standard approach is the Wald test. This is what the Stata command
test
does after a logit or probit regression. Let's see how this works in R by looking at an example:Say, you want to test the hypothesis $\beta_{gre}=\beta_{gpa}$ vs. $\beta_{gre}\neq \beta_{gpa}$. This is equivalent of testing $\beta_{gre} - \beta_{gpa} = 0$. The Wald test statistic is:
$$ W=\frac{(\hat{\beta}-\beta_{0})}{\widehat{\operatorname{se}}(\hat{\beta})}\sim \mathcal{N}(0,1) $$
or
$$ W^2 = \frac{(\hat{\theta}-\theta_{0})^2}{\operatorname{Var}(\hat{\theta})}\sim \chi_{1}^2 $$
Our $\widehat{\theta}$ here is $\beta_{gre} - \beta_{gpa}$ and $\theta_{0}=0$. So all we need is the standard error of $\beta_{gre} - \beta_{gpa}$. We can calculate the standard error with the Delta method:
$$ \hat{se}(\beta_{gre} - \beta_{gpa})\approx \sqrt{\operatorname{Var}(\beta_{gre}) + \operatorname{Var}(\beta_{gpa}) - 2\cdot \operatorname{Cov}(\beta_{gre},\beta_{gpa})} $$
So we also need the covariance of $\beta_{gre}$ and $\beta_{gpa}$. The variance-covariance matrix can be extracted with the
vcov
command after running the logistic regression:Finally, we can calculate the standard error:
So your Wald $z$-value is
To get a $p$-value, just use the standard normal distribution:
In this case we have evidence that the coefficients are different from each other. This approach can be extended to more than two coefficients.
Using
multcomp
This rather tedious calculations can be conveniently done in
R
using themultcomp
package. Here is the same example as above but done withmultcomp
:A confidence interval for the difference of the coefficients can also be calculated:
For additional examples of
multcomp
, see here or here.Likelihood ratio test (LRT)
The coefficients of a logistic regression are found by maximum likelihood. But because the likelihood function involves a lot of products, the log-likelihood is maximized which turns the products into sums. The model that fits better has a higher log-likelihood. The model involving more variables has at least the same likelihood as the null model. Denote the log-likelihood of the alternative model (model containing more variables) with $LL_{a}$ and the log-likelihood of the null model with $LL_{0}$, the likelihood ratio test statistic is:
$$ D=2\cdot (LL_{a} - LL_{0})\sim \chi_{df1-df2}^{2} $$
The likelihood ratio test statistic follows a $\chi^{2}$-distribution with the degrees of freedom being the difference in number of variables. In our case, this is 2.
To perform the likelihood ratio test, we also need to fit the model with the constraint $\beta_{gre}=\beta_{gpa}$ to be able to compare the two likelihoods. The full model has the form $$\log\left(\frac{p_{i}}{1-p_{i}}\right)=\beta_{0}+\beta_{1}\cdot \mathrm{gre} + \beta_{2}\cdot \mathrm{gpa}+\beta_{3}\cdot \mathrm{rank_{2}} + \beta_{4}\cdot \mathrm{rank_{3}}+\beta_{5}\cdot \mathrm{rank_{4}}$$. Our constraint model has the form: $$\log\left(\frac{p_{i}}{1-p_{i}}\right)=\beta_{0}+\beta_{1}\cdot (\mathrm{gre} + \mathrm{gpa})+\beta_{2}\cdot \mathrm{rank_{2}} + \beta_{3}\cdot \mathrm{rank_{3}}+\beta_{4}\cdot \mathrm{rank_{4}}$$.
In our case, we can use
logLik
to extract the log-likelihood of the two models after a logistic regression:The model containing the constraint on
gre
andgpa
has a slightly higher log-likelihood (-232.24) compared to the full model (-229.26). Our likelihood ratio test statistic is:We can now use the CDF of the $\chi^{2}_{2}$ to calculate the $p$-value:
The $p$-value is very small indicating that the coefficients are different.
R has the likelihood ratio test built in; we can use the
anova
function to calculate the likelhood ratio test:Again, we have strong evidence that the coefficients of
gre
andgpa
are significantly different from each other.Score test (aka Rao's Score test aka Lagrange multiplier test)
The Score function $U(\theta)$ is the derivative of the log-likelihood function ($\text{log} L(\theta|x)$) where $\theta$ are the parameters and $x$ the data (the univariate case is shown here for illustration purposes):
$$ U(\theta) = \frac{\partial \text{log} L(\theta|x)}{\partial \theta} $$
This is basically the slope of the log-likelihood function. Further, let $I(\theta)$ be the Fisher information matrix which is the negative expectation of the second derivative of the log-likelihood function with respect to $\theta$. The score test statistics is:
$$ S(\theta_{0})=\frac{U(\theta_{0}^{2})}{I(\theta_{0})}\sim\chi^{2}_{1} $$
The score test can also be calculated using
anova
(the score test statistics is called "Rao"):The conclusion is the same as before.
Note
An interesting relationship between the different test statistics when the model is linear is (Johnston and DiNardo (1997): Econometric Methods): Wald $\geq$ LR $\geq$ Score.