Solved – heteroskedasticity and logistic regression

heteroscedasticitylogisticregression

I have cross sectional data and am using logistic regression. My question is how do I check my data for heteroskedasticity and in case it is present, then how to deal with it in Stata.

I have come across a lot of information using linear regression along with the Breusch-Pagan Test (using command – hettest) or White’s Test (using command – imtest) for testing for heteroskedasticity. And heteroskedasticity is dealt with by computation of – Robust Standard Errors. However, there is less information on this issue in case of logistic regression.

Best Answer

Except in a very technical sense (which @BigBendRegion's answer gets at) heteroskedasticity isn't a "thing' in a logistic regression model.

Heteroskedasticity is when the standard deviation of the errors around the regression line (that is the average distance between the predicted Y value at a given X value and the actual Y values in your dataset for cases with those X values) gets bigger or smaller as X increase. Now, many people (myself included) would argue that heteroskedasticity isn't even that big of a problem for LINEAR regression, except when it's caused by other more serious issues (like nonlinearity or omitted variable bias).

But this whole concept doesn't make sense in logit because logit models don't even HAVE error terms, or rather they don't have error terms that come from the data.

To oversimplify greatly, what a logit model actually "does" is run an OLS model on an unobserved latent variable (call it y*) that represents the "propensity" to do whatever it is your binary variable Y is measuring (we assume that people with a y* over some arbitrary threshold get a Y of 1 and everyone else gets a zero). Obviously we don't know what y* looks like, so in order to specify this model we assume that the errors in this OLS model have a logistic distribution (hence the name of the model) with a standard deviation of of $π/\sqrt{3}$ (the probit model assumes they are normally distributed with a standard deviation of 1). Through some calculus we use this assumption about the distribution of the errors in y* to get us to the logit model of Y itself. This means that the logit model doesn't have an error term, because the distribution of the errors is build into the assumptions of the model itself. So it doesn't make sense to talk about whether the errors get bigger or smaller as X increases, which is what heteroskedasticity is.