Solved – Understanding standard errors in logistic regression

interpretationlogisticself-studystandard errorstata

I am having trouble understanding the meaning of the standard errors in my thesis analysis and whether they indicate that my data (and the estimates) are not good enough.

I am performing an analysis with Stata, on immigrant-native gap in school performance (dependent variable = good / bad results) controlling for a variety of regressors. I used both logit and OLS and I adjusted for cluster at the school level.

The regressors which are giving me trouble are some interaction terms between a dummy for country of origin and a dummy for having foreign friends (I included both base-variables in the model as well). In the logit estimation, more than one of the country*friend variables have a SE greater than 1 (up to 1.80 or so), and some of them are significant as well. This does not happen with the OLS.

I am really confused on how to interpret this. I have always understood that high standard errors are not really a good sign, because it means that your data are too spread out. But still (some of) the coefficients are significant, which works perfect for me because it is the result I was looking for. Can I just ignore the SE? Or does it raise a red flag regarding my results? I usually just ignore the SE in regressions (I know, it is not really what one should do) but I can't recall any other example with such huge SE values.

Best Answer

I think the first thing you need to ensure is that you're not comparing apples to orangutans. Then we will discuss standard errors, statistical significance, and model selection.

Here's how you might compare OLS/LPM and logit coefficients for dummy-dummy interactions. We will model union membership as a function of race and education (both categorical) for US women from the NLS88 survey.

First, we will use OLS with factor variable notation for the interactions:

. sysuse nlsw88, clear
(NLSW, 1988 extract)

. reg union i.race##i.collgrad

      Source |       SS       df       MS              Number of obs =    1878
-------------+------------------------------           F(  5,  1872) =    7.02
       Model |  6.40214176     5  1.28042835           Prob > F      =  0.0000
    Residual |  341.434386  1872  .182390164           R-squared     =  0.0184
-------------+------------------------------           Adj R-squared =  0.0158
       Total |  347.836528  1877  .185315146           Root MSE      =  .42707

-------------------------------------------------------------------------------------
              union |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------------+----------------------------------------------------------------
               race |
             black  |   .0799445   .0250534     3.19   0.001     .0308089    .1290801
             other  |   .1157454   .1076307     1.08   0.282    -.0953433    .3268342
                    |
           collgrad |
      college grad  |   .0975234   .0261143     3.73   0.000     .0463072    .1487395
                    |
      race#collgrad |
black#college grad  |   .0415079   .0563381     0.74   0.461    -.0689841        .152
other#college grad  |  -.0350234   .1867622    -0.19   0.851    -.4013073    .3312606
                    |
              _cons |   .1967546   .0136007    14.47   0.000     .1700804    .2234288
-------------------------------------------------------------------------------------

For instance, black women who also graduated from college are 4.15 percentage points more likely to be in a union.

Now we fit a logit model:

. logit union i.race##i.collgrad, nolog

Logistic regression                               Number of obs   =       1878
                                                  LR chi2(5)      =      33.33
                                                  Prob > chi2     =     0.0000
Log likelihood = -1029.9582                       Pseudo R2       =     0.0159

-------------------------------------------------------------------------------------
              union |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
--------------------+----------------------------------------------------------------
               race |
             black  |   .4458082   .1361797     3.27   0.001      .178901    .7127154
             other  |   .6182459   .5452764     1.13   0.257    -.4504762    1.686968
                    |
           collgrad |
      college grad  |   .5320064   .1397767     3.81   0.000     .2580491    .8059637
                    |
      race#collgrad |
black#college grad  |   .0885629   .2791468     0.32   0.751    -.4585548    .6356807
other#college grad  |  -.2543746    .918575    -0.28   0.782    -2.054748    1.545999
                    |
              _cons |  -1.406703   .0801078   -17.56   0.000    -1.563712   -1.249695
-------------------------------------------------------------------------------------

The logit index function coefficients are not particularly meaningful since they are not effects on the probability of union membership. The sign and the significance might tell you something, but the magnitude of the effect is not clear. Also note that the standard errors are large, like in your own data. For instance, the SE of the college graduate of other race coefficient is almost 1.

To get something comparable to OLS, we will use margins with the contrast operator:

. margins r.race##r.collgrad

Contrasts of predictive margins

Model VCE    : OIM

Expression   : Pr(union), predict()

----------------------------------------------------------------------------------------
                                                     |         df        chi2     P>chi2
-----------------------------------------------------+----------------------------------
                                                race |
                                   (black vs white)  |          1       14.34     0.0002
                                   (other vs white)  |          1        1.20     0.2725
                                              Joint  |          2       15.14     0.0005
                                                     |
                                            collgrad |          1       19.09     0.0000
                                                     |
                                       race#collgrad |
(black vs white) (college grad vs not college grad)  |          1        0.44     0.5085
(other vs white) (college grad vs not college grad)  |          1        0.03     0.8666
                                              Joint  |          2        0.48     0.7869
----------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------
                                                     |            Delta-method
                                                     |   Contrast   Std. Err.     [95% Conf. Interval]
-----------------------------------------------------+------------------------------------------------
                                                race |
                                   (black vs white)  |   .0901999   .0238201      .0435134    .1368864
                                   (other vs white)  |   .1070922   .0976013     -.0842029    .2983873
                                                     |
                                            collgrad |
                 (college grad vs not college grad)  |    .108149   .0247526      .0596347    .1566633
                                                     |
                                       race#collgrad |
(black vs white) (college grad vs not college grad)  |    .041508   .0627785     -.0815355    .1645515
(other vs white) (college grad vs not college grad)  |  -.0350233   .2084485     -.4435749    .3735282
------------------------------------------------------------------------------------------------------

These are pretty close to the OLS effects. For instance, black women who graduated from college are also 4.15 percentage points more likely to be in a union according to the logit model. The SEs are somewhat smaller.

Sometimes you can't run the margins command because you don't have the data. All you have are the logit coefficients from someone's paper. While I said they were not particularly meaningful in their raw form, you can transform the logit index function coefficients into a multiplicative effect by exponentiating them, which is easy enough with a calculator. For example, the index function coefficient for black college graduates was .0885629. If I exponentiate it, I get $\exp(.0885629)=1.092603$. This tells me that black college graduates are 1.09 times more likely to be union members compared to a baseline of $\exp(-1.406703)=0.24494955$ (the baseline is the exponentiated constant from the logit). So this means that the union rate for black college graduates will be $0.24\cdot 1.09$ or about $26$%. OLS and logit with margins, will give the additive effect, so there we get about $19.67+4.15=23.87$. That's pretty darn close. It won't always work out so nicely.

Stata will give you exponentiated coefficients when you specify odds ratios option or:

. logit union i.race##i.collgrad, or nolog

or just use logistic:

. logistic union i.race##i.collgrad, nolog

I learned about these tricks from Maarten L. Buis. There are lots of examples with interactions of various sorts and nonlinear models at that link.

In my toy example, I did not cluster my errors, but that doesn't change the main thrust of these results. Some people don't like clustered standard errors in logit/probits because if the model's errors are heteroscedastic the parameter estimates are inconsistent.

After that long detour, we finally get to statistical significance. In all the models above (OLS, logit index function, logit margins, and OR logit), all the interactions are statistically insignificant (though the main effects generally are not). The standard errors are large compared to the estimates, so the data is consistent with the effects on all scales being zero (the confidence intervals include zero in the additive case and 1 in the multiplicative). If we surveyed enough women, it is possible that we would be able to detect some statistically significant interactions. The statistical significance depends in part on the sample size. If you don't have too many Bhutanese students in your data, it will be hard to detect even the main effect, much less the foreign friends interaction. On the other hand, if the effect is huge, you might be able to detect it with only a few students. Perhaps you can try grouping students by continent instead of country, though too much data-driven variable transformation is to be avoided.

Generally, OLS and non-linear models will give you similar results. If they don't, as may be the case with your data, I think you should report both and let you audience pick. Some people believe OLS/LPM is more robust to departures from assumptions (like heteroscedasticity), others disagree vehemently. You can and should justify a preferred model in various ways, but that's a whole question in itself. Personally, I would report both clustered OLS and non-clustered logit marginal effects (unless there's little difference between the clustered and non-clustered versions). You can also use an LM test to rule out heteroscedasticity.

Finally, with dummy-dummy interactions, I believe the sign and the significance of the index function interaction corresponds to the sign and the significance of the marginal effects. For continuous-continuous interactions (and perhaps continuous-dummy as well), that is generally not the case in non-linear models like the logit.

Related Solutions

Solved – How to interpret decreasing AIC but higher standard errors in model selection

The AIC and standard error measure different things, and if you are trying to minimize standard error, a cross-validation approach may be better to use. Another alternative is the Bayesian information criterion (BIC), which is more parsimonious than the AIC.

Also, here's a good article comparing the relations between various evaluation metric for supervised machine learning: Data mining in metric space: an empirical analysis of supervised learning performance criteria.

Solved – Interpretation of standard error of ARIMA parameters

The standard errors of estimated AR parameters have the same interpretation as the standard error of any other estimate: they are (an estimate of) the standard deviation of its sampling distribution.

The idea is that there is some unknown but fixed underlying data generating process (DGP), governed by an unknown but fixed ARIMA process. The specific time series you observe is a single realization of this process. If you now went and sampled many time series arising from this DGP, then they would all look somewhat different, because of different innovations. However, you could fit an ARIMA model to all of them. Then you would of course get different AR parameter estimates for each time series.

The standard error of the AR estimates is an estimate of the standard deviation of these AR estimates.

A simulation might be helpful. Below, I'll use an AR(2) model with parameters $(1.0,-0.2)$. I'll generate a time series of length 100 using this model, then fit an AR(2) model, store the AR parameter estimates - and repeat this 10,000 times. Finally, I plot histograms of the parameter estimates, plus the actual values as red vertical lines - and then compare the standard deviations of the AR parameter estimates against the (average of the) estimated standard errors. And the two match up.

nn <- 100
n.sims <- 10000
true.model <- list(ar=c(1.0,-0.2))

params <- ses <- matrix(NA,nrow=n.sims,ncol=length(true.model$ar))
for ( ii in 1:n.sims ) {
	set.seed(ii)
	series <- arima.sim(model=true.model,n=nn)
	model <- arima(series,order=c(2,0,0),include.mean=FALSE)
	params[ii,] <- coefficients(model)
	ses[ii,] <- sqrt(diag(model$var.coef))
}

opar <- par(mfrow=c(1,2))
    for ( jj in seq_along(true.model$ar) ) {
		hist(params[,jj],col="grey",xlab="",main=paste0("AR(",jj,") parameter"))
		abline(v=true.model$ar[jj],lwd=2,col="red")
    }
par(opar)

apply(params,2,sd)
# [1] 0.09844388 0.09795008
apply(ses,2,mean)
# [1] 0.09754488 0.09833490

Note that I simulate with a zero mean and explicitly tell arima() to not use a mean. And that the entire exercise crucially depends on the assumption that we know the ARIMA orders with certainty! If we first need to select the correct order, then everything will be biased, and the standard errors lose their interpretation. (Yes, this kind of makes all this a somewhat theoretical and academic exercise.)

If you want to dive more deeply into the maths, any mathematical time series textbook should do well. (Anything with "business" in the title will likely gloss over these details.) I recently skimmed Time Series: Theory and Methods by Brockwell and Davies (2006), which looked pretty good, but I can't recall offhand whether this topic was treated at any depth there.

Best Answer

Related Solutions

Solved – How to interpret decreasing AIC but higher standard errors in model selection

Solved – Interpretation of standard error of ARIMA parameters

Related Question