Solved – Meaning of p-value of logistic regression model variables

interpretationlogisticp-valuerregression

So I'm working with logistic regression models in R. Though I'm still new to statistics I feel like I got a bit of an understanding for regression models by now, but there's still something that bothers me:

Looking at the linked picture, you see the summary R prints for an example model I created. The model is trying to predict, if an email in the dataset will be refound or not (binary variable isRefound) and the dataset contains two variables closely related to isRefound , namely next24 and next7days – these are also binary and tell if a mail will be clicked in the next 24hrs / next 7 days from the current point in the logs.

The high p-value should indicate, that the impact this variable has on the model prediction is pretty random, isn't it?
Based on this I don't understand why the precision of the models predictions drops below 10% when these two variables are left out of the calculation formula. If these variables show such a low significance, why does removing them from the model have such a big impact?

Best regards and thanks in advance,
Rickyfox

enter image description here


EDIT:

First I removed only next24, which should yield a low impact because it's coef is pretty small. As expected, little changed – not gonna upload a pic for that.

Removing next7days tho had a big impact on the model: AIC 200k up, precision down to 16% and recall down to 73%

enter image description here

Best Answer

Basically, it looks like you are having a multicollinearity problem. There is a lot of material available about this, starting on this website or on wikipedia.

Briefly, the two predictors appear to be genuinely related to your outcome but they are also probably highly correlated with each other (note that with more than two variables, it's still possible to have multicollinearity issues without strong bivariate correlations). This does of course make a lot of sense: All emails clicked within 24 hours have also been clicked within 7 days (by definition) and most emails have probably not been clicked at all (not in 24 hours and not in 7 days).

One way this shows in the output you presented is through the incredibly large standard errors/CI for the relevant coefficients (judging by the fact you are using bigglm and that even tiny coefficients are highly significant, it seems your sample size should be more than enough to get good estimates). Other things you can do to detect this type of problems: Look at pairwise correlations, remove only one of the suspect variables (as suggested by @Nick Sabbe), test significance for both variables jointly.

More generally, high p-values do not mean that the effect is small or random but only that there is no evidence that the coefficient is different from 0. It can also be very large, you just don't know (either because the sample size is too small or because there is some other issue with the model).

Related Question