Solved – Estimates diverging using continuous probabilities in logistic regression

logisticstata

When using a binomial family, logit link for GLM (or GEE in my case), I notice that my model estimates diverge when my response variables (which are continuous probabilities with range 0 to 1) include 0 or 1 (or 0 <= y <= 1) as observed values, but the models with response variables that don't include 0 or 1 (or 0 < y < 1) are able to converge just fine.

Question:

  • Why does this happen?

When running a logistic regression model (with 0 < y < 1) the model runs fine, as does the model when the response variable is dichotomous 0/1.

I suspect the following: say I have observations 0 < y <= 1. In this case, the algorithm sees my ones but not any zeros, and then craps out saying "some groups have fewer than x observations," the aforementioned group being the ones that are supposed to have zeros.

Secondary question:

  • If I exclude observations that are 0 or 1 in order to fit my models, am I biasing my results?

Here's an example: my response variable is graduation rate expressed as a percentage. For the logistic regression models, there are apparently schools that have 100% graduation rate (seen as a 1 in my dataset). Would it be a valid strategy to drop these schools from the model, and what are the implications in interpretation? Is this akin to dropping outliers willy-nilly?

Best Answer

It shouldn't happen, if you do the taylor series well - I'd suggest starting it at different initial values. A good choice is to set the intercept equal to the logit of the total proportion in your sample, and all other betas to zero. So you have $p_{i}=\frac{y_{i}}{n_{i}}$ as the observed proportions for each unit. Just set $$\beta_0=logit\left(\frac{\overline{y}}{\overline{n}}\right)$$

and all other betas equal to zero as your starting values. This should stop the 0s and 1s giving you problems.

Another way to stabilise your results is the good old (+1) and (+2) rule, which is similar to ridging in ordinary regression. To do this you regress

$$\tilde{p}_{i}=\frac{y_{i}+1}{n_{i}+2}$$

On X directly using ols regression (i.e. no iterations). This is shown to be a generalised MLE in this paper

Related Question