Solved – Meaning of p-value of logistic regression model variables

interpretationlogisticp-valuerregression

So I'm working with logistic regression models in R. Though I'm still new to statistics I feel like I got a bit of an understanding for regression models by now, but there's still something that bothers me:

Looking at the linked picture, you see the summary R prints for an example model I created. The model is trying to predict, if an email in the dataset will be refound or not (binary variable isRefound) and the dataset contains two variables closely related to isRefound , namely next24 and next7days – these are also binary and tell if a mail will be clicked in the next 24hrs / next 7 days from the current point in the logs.

The high p-value should indicate, that the impact this variable has on the model prediction is pretty random, isn't it?
Based on this I don't understand why the precision of the models predictions drops below 10% when these two variables are left out of the calculation formula. If these variables show such a low significance, why does removing them from the model have such a big impact?

Best regards and thanks in advance,
Rickyfox

enter image description here

EDIT:

First I removed only next24, which should yield a low impact because it's coef is pretty small. As expected, little changed – not gonna upload a pic for that.

Removing next7days tho had a big impact on the model: AIC 200k up, precision down to 16% and recall down to 73%

enter image description here

Best Answer

Basically, it looks like you are having a multicollinearity problem. There is a lot of material available about this, starting on this website or on wikipedia.

Briefly, the two predictors appear to be genuinely related to your outcome but they are also probably highly correlated with each other (note that with more than two variables, it's still possible to have multicollinearity issues without strong bivariate correlations). This does of course make a lot of sense: All emails clicked within 24 hours have also been clicked within 7 days (by definition) and most emails have probably not been clicked at all (not in 24 hours and not in 7 days).

One way this shows in the output you presented is through the incredibly large standard errors/CI for the relevant coefficients (judging by the fact you are using bigglm and that even tiny coefficients are highly significant, it seems your sample size should be more than enough to get good estimates). Other things you can do to detect this type of problems: Look at pairwise correlations, remove only one of the suspect variables (as suggested by @Nick Sabbe), test significance for both variables jointly.

More generally, high p-values do not mean that the effect is small or random but only that there is no evidence that the coefficient is different from 0. It can also be very large, you just don't know (either because the sample size is too small or because there is some other issue with the model).

Related Solutions

R – How to Select the Best Subset of Variables for Parsimonious Binary Logistic Regression Models

Variable selection without penalization is invalid.

Solved – Why are all the p-values so low in logistic regression model

As pointed out in comments, the more complicated model has too many predictors to be taken seriously. I focus here on models with 6 predictors.

I have used MATLAB very occasionally in the past but not for any related purpose and am emphatically no expert. But on the face of it your MATLAB call does not feed a fractional response (outcome, dependent variable) to the function either directly or indirectly and it is amazing that it produces any output at all. If you are using the same naming conventions across different software, then FC is a count, not a binary response, and not a fit present for a logit function.

Here is the result of a calculation in Stata 14 of a logit model in which the response is treated as a continuous proportion:

. fracreg logit fcperhab ddp schabs train exp_hab  exp_ratio sch_tot

Iteration 0:   log pseudolikelihood = -19.110279  
Iteration 1:   log pseudolikelihood =  -18.06008  
Iteration 2:   log pseudolikelihood = -17.979059  
Iteration 3:   log pseudolikelihood = -17.976617  
Iteration 4:   log pseudolikelihood = -17.976616  

Fractional logistic regression                  Number of obs     =         30
                                                Wald chi2(6)      =      21.06
                                                Prob > chi2       =     0.0018
Log pseudolikelihood = -17.976616               Pseudo R2         =     0.0865

------------------------------------------------------------------------------
             |               Robust
    fcperhab |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         ddp |   .0033372   .0044366     0.75   0.452    -.0053583    .0120327
      schabs |   .0195366   .0162846     1.20   0.230    -.0123806    .0514538
       train |   .0000243   8.40e-06     2.90   0.004     7.85e-06    .0000408
     exp_hab |  -4.832934   10.20002    -0.47   0.636     -24.8246    15.15873
   exp_ratio |  -.0016535    .904611    -0.00   0.999    -1.774658    1.771351
     sch_tot |   1.15e-06   9.91e-07     1.16   0.248    -7.98e-07    3.09e-06
       _cons |  -.1102337   .3944965    -0.28   0.780    -.8834326    .6629653
------------------------------------------------------------------------------

For all that the results may seem disappointing, there is nothing pathological about the output. Non-Stata users can probably guess that _cons means the intercept. As context here are some summary statistics:

. su fcperhab ddp schabs train exp_hab  exp_ratio sch_tot

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    fcperhab |         30    .6355612    .2472515   .2056616   .9978048
         ddp |         30    7.766667    19.39016          0         85
      schabs |         30        8.83    10.14478          0       40.2
       train |         30    18741.83    26769.19          0     108643
     exp_hab |         30    .0152065    .0195379          0   .1046035
-------------+---------------------------------------------------------
   exp_ratio |         30    .3794279    .2442145          0   .7318189
     sch_tot |         30    203229.1    410059.5        251    2198181

Stata users who have not yet upgraded to 14 [released 7 April 2015] may note that glm fcperhab ddp schabs train exp_hab exp_ratio sch_tot, link(logit) f(binomial) vce(robust) gives the same calculation. But it is important here to spell out to Stata, in whatever version is used, that binomial is at best a convenient fiction here. In fracreg that is automatic; otherwise, vce(robust) is that signal and if we omit it results are quite different, but still free of pathological very high or very low P-values:

. glm fcperhab ddp schabs train exp_hab  exp_ratio sch_tot, link(logit) f(binomial)
note: fcperhab has noninteger values

Iteration 0:   log likelihood = -13.058365  
Iteration 1:   log likelihood = -12.952102  
Iteration 2:   log likelihood = -12.950226  
Iteration 3:   log likelihood = -12.950225  
Iteration 4:   log likelihood = -12.950225  

Generalized linear models                         No. of obs      =         30
Optimization     : ML                             Residual df     =         23
                                                  Scale parameter =          1
Deviance         =  5.476451096                   (1/df) Deviance =   .2381066
Pearson          =   4.86766495                   (1/df) Pearson  =   .2116376

Variance function: V(u) = u*(1-u/1)               [Binomial]
Link function    : g(u) = ln(u/(1-u))             [Logit]

                                                  AIC             =   1.330015
Log likelihood   = -12.95022455                   BIC             =  -72.75109

------------------------------------------------------------------------------
             |                 OIM
    fcperhab |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         ddp |   .0033372    .023125     0.14   0.885    -.0419869    .0486613
      schabs |   .0195366   .0430982     0.45   0.650    -.0649343    .1040075
       train |   .0000243   .0000275     0.88   0.377    -.0000297    .0000783
     exp_hab |  -4.832937   27.53167    -0.18   0.861    -58.79402    49.12815
   exp_ratio |  -.0016538   1.961583    -0.00   0.999    -3.846287    3.842979
     sch_tot |   1.15e-06   2.69e-06     0.43   0.671    -4.13e-06    6.43e-06
       _cons |  -.1102337    .785408    -0.14   0.888    -1.649605    1.429138
------------------------------------------------------------------------------

There are many other scientific and statistical issues not clear without further discussion and analysis, and I will raise just two:

Some predictors appear to be absolute counts or amounts, and it's not all clear why absolute values are natural here.
Some of the predictors may need or benefit from transformation.

EDIT:

Best Answer

Related Solutions

R – How to Select the Best Subset of Variables for Parsimonious Binary Logistic Regression Models

Solved – Why are all the p-values so low in logistic regression model

Related Question