Solved – Simultaneous Equation System for logit/probit

logitprobitrregression

Is it possible have an SES where the component equations are probabilistic, say, logit or probit? I am evaluating a number of quality metrics of services provided by a number of providers. The metrics are binary (pass/fail quality certification). The obvious approach would appear to be to estimate the logit/probit equations, conditional upon provider characteristics, jointly in something like an SUR, but I can't find anything like that in the discrete choice literature.

Best Answer

This is possible. The type of model you need is called multivariate probit. For a textbook treatment you can refer to Greene's Econometric Analysis.

However, from the computational point of view, these models can be laborious. Convergence can be slow or can even fail.

Multivariate probit models are implemented in R and in Stata.

Wald test

One standard approach is the Wald test. This is what the Stata command test does after a logit or probit regression. Let's see how this works in R by looking at an example:

mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv") # Load dataset from the web
mydata$rank <- factor(mydata$rank)
mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial") # calculate the logistic regression

summary(mylogit)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.989979   1.139951  -3.500 0.000465 ***
gre          0.002264   0.001094   2.070 0.038465 *  
gpa          0.804038   0.331819   2.423 0.015388 *  
rank2       -0.675443   0.316490  -2.134 0.032829 *  
rank3       -1.340204   0.345306  -3.881 0.000104 ***
rank4       -1.551464   0.417832  -3.713 0.000205 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Say, you want to test the hypothesis $\beta_{gre}=\beta_{gpa}$ vs. $\beta_{gre}\neq \beta_{gpa}$. This is equivalent of testing $\beta_{gre} - \beta_{gpa} = 0$. The Wald test statistic is:

$$ W=\frac{(\hat{\beta}-\beta_{0})}{\widehat{\operatorname{se}}(\hat{\beta})}\sim \mathcal{N}(0,1) $$

$$ W^2 = \frac{(\hat{\theta}-\theta_{0})^2}{\operatorname{Var}(\hat{\theta})}\sim \chi_{1}^2 $$

Our $\widehat{\theta}$ here is $\beta_{gre} - \beta_{gpa}$ and $\theta_{0}=0$. So all we need is the standard error of $\beta_{gre} - \beta_{gpa}$. We can calculate the standard error with the Delta method:

$$ \hat{se}(\beta_{gre} - \beta_{gpa})\approx \sqrt{\operatorname{Var}(\beta_{gre}) + \operatorname{Var}(\beta_{gpa}) - 2\cdot \operatorname{Cov}(\beta_{gre},\beta_{gpa})} $$

So we also need the covariance of $\beta_{gre}$ and $\beta_{gpa}$. The variance-covariance matrix can be extracted with the vcov command after running the logistic regression:

var.mat <- vcov(mylogit)[c("gre", "gpa"),c("gre", "gpa")]

colnames(var.mat) <- rownames(var.mat) <- c("gre", "gpa")

              gre           gpa
gre  1.196831e-06 -0.0001241775
gpa -1.241775e-04  0.1101040465

Finally, we can calculate the standard error:

se <- sqrt(1.196831e-06 + 0.1101040465 -2*-0.0001241775)
se
[1] 0.3321951

So your Wald $z$-value is

wald.z <- (gre-gpa)/se
wald.z
[1] -2.413564

To get a $p$-value, just use the standard normal distribution:

2*pnorm(-2.413564)
[1] 0.01579735

In this case we have evidence that the coefficients are different from each other. This approach can be extended to more than two coefficients.

Using multcomp

This rather tedious calculations can be conveniently done in R using the multcomp package. Here is the same example as above but done with multcomp:

library(multcomp)

glht.mod <- glht(mylogit, linfct = c("gre - gpa = 0"))

summary(glht.mod)    

Linear Hypotheses:
               Estimate Std. Error z value Pr(>|z|)  
gre - gpa == 0  -0.8018     0.3322  -2.414   0.0158 *

confint(glht.mod)

A confidence interval for the difference of the coefficients can also be calculated:

Quantile = 1.96
95% family-wise confidence level


Linear Hypotheses:
               Estimate lwr     upr    
gre - gpa == 0 -0.8018  -1.4529 -0.1507

For additional examples of multcomp, see here or here.

Likelihood ratio test (LRT)

The coefficients of a logistic regression are found by maximum likelihood. But because the likelihood function involves a lot of products, the log-likelihood is maximized which turns the products into sums. The model that fits better has a higher log-likelihood. The model involving more variables has at least the same likelihood as the null model. Denote the log-likelihood of the alternative model (model containing more variables) with $LL_{a}$ and the log-likelihood of the null model with $LL_{0}$, the likelihood ratio test statistic is:

$$ D=2\cdot (LL_{a} - LL_{0})\sim \chi_{df1-df2}^{2} $$

The likelihood ratio test statistic follows a $\chi^{2}$-distribution with the degrees of freedom being the difference in number of variables. In our case, this is 2.

To perform the likelihood ratio test, we also need to fit the model with the constraint $\beta_{gre}=\beta_{gpa}$ to be able to compare the two likelihoods. The full model has the form $$\log\left(\frac{p_{i}}{1-p_{i}}\right)=\beta_{0}+\beta_{1}\cdot \mathrm{gre} + \beta_{2}\cdot \mathrm{gpa}+\beta_{3}\cdot \mathrm{rank_{2}} + \beta_{4}\cdot \mathrm{rank_{3}}+\beta_{5}\cdot \mathrm{rank_{4}}$$. Our constraint model has the form: $$\log\left(\frac{p_{i}}{1-p_{i}}\right)=\beta_{0}+\beta_{1}\cdot (\mathrm{gre} + \mathrm{gpa})+\beta_{2}\cdot \mathrm{rank_{2}} + \beta_{3}\cdot \mathrm{rank_{3}}+\beta_{4}\cdot \mathrm{rank_{4}}$$.

mylogit2 <- glm(admit ~ I(gre + gpa) + rank, data = mydata, family = "binomial")

In our case, we can use logLik to extract the log-likelihood of the two models after a logistic regression:

L1 <- logLik(mylogit)
L1
'log Lik.' -229.2587 (df=6)

L2 <- logLik(mylogit2)
L2
'log Lik.' -232.2416 (df=5)

The model containing the constraint on gre and gpa has a slightly higher log-likelihood (-232.24) compared to the full model (-229.26). Our likelihood ratio test statistic is:

D <- 2*(L1 - L2)
D
[1] 16.44923

We can now use the CDF of the $\chi^{2}_{2}$ to calculate the $p$-value:

1-pchisq(D, df=1)
[1] 0.01458625

The $p$-value is very small indicating that the coefficients are different.

R has the likelihood ratio test built in; we can use the anova function to calculate the likelhood ratio test:

anova(mylogit2, mylogit, test="LRT")

Analysis of Deviance Table

Model 1: admit ~ I(gre + gpa) + rank
Model 2: admit ~ gre + gpa + rank
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
1       395     464.48                       
2       394     458.52  1   5.9658  0.01459 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Again, we have strong evidence that the coefficients of gre and gpa are significantly different from each other.

Score test (aka Rao's Score test aka Lagrange multiplier test)

The Score function $U(\theta)$ is the derivative of the log-likelihood function ($\text{log} L(\theta|x)$) where $\theta$ are the parameters and $x$ the data (the univariate case is shown here for illustration purposes):

$$ U(\theta) = \frac{\partial \text{log} L(\theta|x)}{\partial \theta} $$

This is basically the slope of the log-likelihood function. Further, let $I(\theta)$ be the Fisher information matrix which is the negative expectation of the second derivative of the log-likelihood function with respect to $\theta$. The score test statistics is:

$$ S(\theta_{0})=\frac{U(\theta_{0}^{2})}{I(\theta_{0})}\sim\chi^{2}_{1} $$

The score test can also be calculated using anova (the score test statistics is called "Rao"):

anova(mylogit2, mylogit,  test="Rao")

Analysis of Deviance Table

Model 1: admit ~ I(gre + gpa) + rank
Model 2: admit ~ gre + gpa + rank
  Resid. Df Resid. Dev Df Deviance    Rao Pr(>Chi)  
1       395     464.48                              
2       394     458.52  1   5.9658 5.9144  0.01502 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The conclusion is the same as before.

Note

An interesting relationship between the different test statistics when the model is linear is (Johnston and DiNardo (1997): Econometric Methods): Wald $\geq$ LR $\geq$ Score.

Solved – Probit/Logit Model – How to find the parameter $\beta$

It is hard to answer your question, as you do not provide the background you have. Here is a general answer.

You want to find $\mathbb{P}[Y_i = 1 | X_i]$ when you have a bunch of data that comes from the same distribution, that is, the result ($Y_i$) is influenced the same way by the inputs ($X_i$) of each sample you have. Read more: I.I.D. (Wikipedia).

You know $X_i$ and $Y_i$ for your samples, and you want to find a function that maps the inputs to the outputs as close as possible. This is what the probit/logit models are doing, and they have a parameter $\beta$ to dictate the influence of the different input variables. To find the best $\beta$, we have first to define what the best is.

In linear regression models, the best model is often defined by the model that has the best $R^2$ measure with the output, or the one that has the smallest Mean Squared Error (Wikipedia). In classification tasks, a common measure is the Logistic loss (Wikipedia).

The value of the loss function is $L(y, f(x,\beta))$ where $L$ is the loss function, $x,y$ the input and output variables, and $\beta$ are the parameters of your model. It is a comparison of your model with the real result $y$, and gives you a metric to evaluate your model. You want to have the lowest error with respect to $\beta$, $$\hat{\beta} = \arg \min_{\beta} L(y,f(x,\beta))$$ Depending on your model and cost function, you can equate the derivative to 0 and compute the $\beta$ that makes it possible based on the data. In the case of the transformed models you are interested about, this is not possible, as as mentionned in the comments, no closed solution exists. However, you can come very close to the optimal solution by using gradient descent (Wikipedia). The idea is that the derivative of the function at a certain $\beta$ will point to the direction of highest increase in the cost function with respect to $\beta$, and if you go in the other direction, your $\beta$ will improve and the error will lower. As long as your cost function is convex, you will fill a global minimum, the optimal $\beta$.

How to find the parameters of the logistic regression is the introduction of a lot of books and classes in Machine learning. For a more specific answer, I'd advise looking at

Elements of Statistical Learning, by Hastie, Tibshirani and Friedman (Book website, free download)
Pattern Recognition and Machine Learning, by Christopher Bishop.
Machine Learning: A Probabilistic Perspective, by Kevin Murphy.
Andrew Ng's Mooc
Any other book/mooc/class on statistical/machine learning will introduce this concept.

You can also take a look at questions on a similar topic on CrossValidated,

Best Answer

Related Solutions

Solved – How to test for simultaneous equality of choosen coefficients in logit or probit model

Wald test

Likelihood ratio test (LRT)

Score test (aka Rao's Score test aka Lagrange multiplier test)

Solved – Probit/Logit Model – How to find the parameter $\beta$

Related Question