Which tests are used to calculate the individual P-values in the Stata logit command (binary logistic regression)?

logisticp-valueregressionstata

This picture is from page 1290 in the Stata manual:

The model can be made with the following code in Stata:

use https://www.stata-press.com/data/r16/auto

keep make mpg weight foreign

logit foreign weight mpg

As I understand the overall fit for the model is calculated with a chi2 test (Prob > chi2 = 0.0000), but how are the individual P-values (P>|z|) calculated?

Best Answer

The tests shown by Stata for the individual coefficients are Wald tests whose test statistics have the form $$ W_j = \frac{\hat{\beta}_j}{\widehat{\operatorname{SE}}(\hat{\beta}_j)}\sim N(0,1) $$

Under the hypothesis than an individual coefficient is zero, these statistics will asymptotically follow the standard normal distribution. Hence, they are $z$-values. The corresponding $p$-values are calculated from the standard normal cdf $\Phi(x)$. Specifically, for two-sided $p$-values: $p=2\cdot\Phi(-|z|)$. For example in Stata for the coefficient for mpg:

di 2*normal(-abs(1.83))
.06724994

Reference

Hosmer DW, Lemeshow S, Sturdivant RX (2013): Applied Logistic Regression. 3rd ed. Wiley.

Related Solutions

Solved – Why are all the p-values so low in logistic regression model

As pointed out in comments, the more complicated model has too many predictors to be taken seriously. I focus here on models with 6 predictors.

I have used MATLAB very occasionally in the past but not for any related purpose and am emphatically no expert. But on the face of it your MATLAB call does not feed a fractional response (outcome, dependent variable) to the function either directly or indirectly and it is amazing that it produces any output at all. If you are using the same naming conventions across different software, then FC is a count, not a binary response, and not a fit present for a logit function.

Here is the result of a calculation in Stata 14 of a logit model in which the response is treated as a continuous proportion:

. fracreg logit fcperhab ddp schabs train exp_hab  exp_ratio sch_tot

Iteration 0:   log pseudolikelihood = -19.110279  
Iteration 1:   log pseudolikelihood =  -18.06008  
Iteration 2:   log pseudolikelihood = -17.979059  
Iteration 3:   log pseudolikelihood = -17.976617  
Iteration 4:   log pseudolikelihood = -17.976616  

Fractional logistic regression                  Number of obs     =         30
                                                Wald chi2(6)      =      21.06
                                                Prob > chi2       =     0.0018
Log pseudolikelihood = -17.976616               Pseudo R2         =     0.0865

------------------------------------------------------------------------------
             |               Robust
    fcperhab |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         ddp |   .0033372   .0044366     0.75   0.452    -.0053583    .0120327
      schabs |   .0195366   .0162846     1.20   0.230    -.0123806    .0514538
       train |   .0000243   8.40e-06     2.90   0.004     7.85e-06    .0000408
     exp_hab |  -4.832934   10.20002    -0.47   0.636     -24.8246    15.15873
   exp_ratio |  -.0016535    .904611    -0.00   0.999    -1.774658    1.771351
     sch_tot |   1.15e-06   9.91e-07     1.16   0.248    -7.98e-07    3.09e-06
       _cons |  -.1102337   .3944965    -0.28   0.780    -.8834326    .6629653
------------------------------------------------------------------------------

For all that the results may seem disappointing, there is nothing pathological about the output. Non-Stata users can probably guess that _cons means the intercept. As context here are some summary statistics:

. su fcperhab ddp schabs train exp_hab  exp_ratio sch_tot

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    fcperhab |         30    .6355612    .2472515   .2056616   .9978048
         ddp |         30    7.766667    19.39016          0         85
      schabs |         30        8.83    10.14478          0       40.2
       train |         30    18741.83    26769.19          0     108643
     exp_hab |         30    .0152065    .0195379          0   .1046035
-------------+---------------------------------------------------------
   exp_ratio |         30    .3794279    .2442145          0   .7318189
     sch_tot |         30    203229.1    410059.5        251    2198181

Stata users who have not yet upgraded to 14 [released 7 April 2015] may note that glm fcperhab ddp schabs train exp_hab exp_ratio sch_tot, link(logit) f(binomial) vce(robust) gives the same calculation. But it is important here to spell out to Stata, in whatever version is used, that binomial is at best a convenient fiction here. In fracreg that is automatic; otherwise, vce(robust) is that signal and if we omit it results are quite different, but still free of pathological very high or very low P-values:

. glm fcperhab ddp schabs train exp_hab  exp_ratio sch_tot, link(logit) f(binomial)
note: fcperhab has noninteger values

Iteration 0:   log likelihood = -13.058365  
Iteration 1:   log likelihood = -12.952102  
Iteration 2:   log likelihood = -12.950226  
Iteration 3:   log likelihood = -12.950225  
Iteration 4:   log likelihood = -12.950225  

Generalized linear models                         No. of obs      =         30
Optimization     : ML                             Residual df     =         23
                                                  Scale parameter =          1
Deviance         =  5.476451096                   (1/df) Deviance =   .2381066
Pearson          =   4.86766495                   (1/df) Pearson  =   .2116376

Variance function: V(u) = u*(1-u/1)               [Binomial]
Link function    : g(u) = ln(u/(1-u))             [Logit]

                                                  AIC             =   1.330015
Log likelihood   = -12.95022455                   BIC             =  -72.75109

------------------------------------------------------------------------------
             |                 OIM
    fcperhab |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         ddp |   .0033372    .023125     0.14   0.885    -.0419869    .0486613
      schabs |   .0195366   .0430982     0.45   0.650    -.0649343    .1040075
       train |   .0000243   .0000275     0.88   0.377    -.0000297    .0000783
     exp_hab |  -4.832937   27.53167    -0.18   0.861    -58.79402    49.12815
   exp_ratio |  -.0016538   1.961583    -0.00   0.999    -3.846287    3.842979
     sch_tot |   1.15e-06   2.69e-06     0.43   0.671    -4.13e-06    6.43e-06
       _cons |  -.1102337    .785408    -0.14   0.888    -1.649605    1.429138
------------------------------------------------------------------------------

There are many other scientific and statistical issues not clear without further discussion and analysis, and I will raise just two:

Some predictors appear to be absolute counts or amounts, and it's not all clear why absolute values are natural here.
Some of the predictors may need or benefit from transformation.

Solved – Stata automatically tests collinearity for logistic regression

Whether or not you want to omit a variable (or do something else) when the correlation is very high but not perfect is a choice. Stata treats its users as adults and lets you make your own choices. With perfect collinearity there is no choice: there is no information present in the data that allows Stata to separate the two effects. It could return an error message and not estimate the model, or Stata can chose one of the offending variables to omit. StataCorp chose the latter.

Best Answer

Related Solutions

Solved – Why are all the p-values so low in logistic regression model

Solved – Stata automatically tests collinearity for logistic regression

Related Question