Regression Analysis – Understanding Implications of Very Low but Statistically Significant Average Marginal Effects

interpretationlogisticmarginal-effectregressionstatistical significance

I built a multivariate logistic regression model, which is largely a replication of a published paper (I just some different data). My regression table (with the coefficient reported as log odds) looks like this:

=============================================
                      Dependent variable:    
                  ---------------------------
                          DV.dum         
---------------------------------------------
IV.dum1                    1.205***          
                            (0.184)          
                                             
IV.dum2                    1.207***          
                            (0.185)          
                                             
IV.continuous1             0.001           
                            (0.001)          
                                             
IV.continuous2             0.001***          
                           (0.0002)          
                                             
IV.continuous3            -0.002***         
                           (0.0003)          
                                             
control.dum                24.595           
                           (285.040)         
                                             
control.continuous         -0.003***         
                           (0.0003)          
                                             
Constant                   -4.035***         
                            (0.137)          
                                             
---------------------------------------------
Observations                66,310           
Log Likelihood            -1,557.481         
Akaike Inf. Crit.          3,130.962         
=============================================
Note:             *p<0.1; **p<0.05; ***p<0.01

The results are very similar results to those of the published paper, in terms of both direction and magnitude of the estimated effects. However, the authors only reported their results as log odds and as they point in the same direction as hypothesized, they concluded that the results confirmed their hypothesis. I like to transform the estimates of logit models to average marginal effects (AMEs), because they are easier to interpret. However, I got these AMEs, which are very low, almost zero:

Variable AME SE
IV.dum1 4.999893e-03 8.085180e-04
IV.dum2 5.008722e-03 8.136398e-04
IV.continuous1 2.587048e-06 2.740353e-06
IV.continuous2 3.324357e-06 7.552852e-07
IV.continuous3 -8.575452e-06 1.323206e-06
control.dum 1.020670e-01 1.182942e+00
control.continuous -1.226762e-05 1.279210e-06

Now, I am wondering if I maybe overlooked something or if I am interpreting these results wrong? To my understanding, a AME of 0.005 means that a one unit increase in that variable increases the likelihood of the dependent variable by only 0.5% — which is not very much. I am wondering if that means that the variable doesn't really explain any variations in the dependent variable, although most of them are statistically significant, also given that this AME is already one of the highest. So, does the model "suck" and the original authors just didn't honestly report it or am I not thinking straight here?

For the modeling, I used R's glm function for fitting generalized linear models. I used the margins function of the margins library to calculate the AMEs (see vignette here for more info).

Best Answer

The heart of your question is in the title. Statistical significance does not mean practical significance and this is a clear example of that. A $p$ value quantifies the probability of the observed data given the null hypothesis is true. When this probability is low, it only tells us that the probability of your data not being literally zero is very low. It's clear that this is usually not all that helpful by itself when most relationships between variables are not completely orthogonal to each other.

The reason your data is statistically significant is due to the massive sample size you have of more than sixty thousand observations. Clearly this reduces the standard error in your estimates by a lot, and so even minor fluctuations between $x$ and $y$ are "significant" for whatever that practically means. Examples of that can be found here, but I provide a regression-based version of that with a simple simulation in R.

set.seed(123)
n <- 66000
x <- rnorm(n)
e <- rnorm(n)
b0 <- 0
b1 <- .01
y <- b0 + (b1*x) + e
cor(x,y)
fit <- lm(y ~ x)
summary(fit)
plot(x,y,main="Simulated Example",
     xlab="X",
     ylab="Y")
abline(fit,col="darkred",lwd=3)

Here I have modeled the simple linear regression of

$$ y = 0 + .01x_1 + \epsilon $$

where the conditional mean is zero, the slope of $x_1$ is $.01$, and the error term is normally distributed with a mean of zero and a standard deviation of $1$. In essence, our scatterplot should show little association between $x$ and $y$, as for every one point increase in $x$, we only get a $.01$ increase in $y$.

If you inspect the correlation and regression summary, there is very little relationship between $x$ and $y$, and yet it is still statistically significant after using $n = 66,000$. The regression line shows that the association is basically flat. So while it is statistically significant, it is of little practical relevance to what we want.

enter image description here

Related Question