Solved – How to describe and present the issue of perfect separation

logisticseparation

Folks who work with logistic regression are familiar with the issue of perfect separation: if you have a variable specific values of which are associated with only one of the two outcomes (say a binary $x$ such that all observations with $x=1$ have outcome = 1), the likelihood blows up, and the maximum likelihood estimates run off to infinity. glm in R may or may not handle that terribly well, as the perfect prediction error message can show up for reasons other than perfect prediction/separation. logit in Stata identifies such variables and problematic values, and discards them from the analysis.

My question is different from what to do if you have a perfect separation. That I can handle by recoding my variables (they are all categorical, so I can simply combine categories), or with Firth version of logistic regression if I want to be fancy.

Instead, I wonder what the common ways are of describing this. I have a data set with circa 100 patients with about 50% proportion "positive", and some categories of the demographic variables produce this perfect prediction. Let's just say that all 7 green-eyed people have a "positive" outcome. This may be a small sample peculiarity that would disappear if I had sample size of 1000 and 70 green-eyed people, but it may be clinically meaningful, as in that larger sample I could have 60 out of 70 green-eyed people that would have a "positive" outcome with high odds ratios.

So it's nice to say that I used a Bayesian or some other shrinkage method, but in describing how I got there, I would need to admit that I had perfect prediction/separation, and had to find a more sophisticated technique to get any results at all. What would be a good language to use here?

Best Answer

While performing my excavation activities on no-answer questions, I found this very sensible one, to which, I guess, by now the OP has found an answer.
But I realized that I had various questions of my own regarding the issue of perfect separation in logistic regression, and a (quick) search in the literature, did not seem to answer them. So I decided to start a little research project of my own (probably re-inventing the wheel), and with this answer I would want to share some of its preliminary results. I believe these results contribute towards an understanding of whether the issue of perfect separation is a purely "technical" one, or whether it can be given a more intuitive description/explanation.

My first concern was to understand the phenomenon in algorithmic terms, rather than the general theory behind it: under which conditions the maximum likelihood estimation approach will "break-down" if fed with a data sample that contains a regressor for which the phenomenon of perfect separation exists?

Preliminary results (theoretical and simulated) indicate that:
1) It matters whether a constant term is included in the logit specification.
2) It matters whether the regressor in question is dichotomous (in the sample), or not.
3) If dichotomous, it may matter whether it takes the value $0$ or not.
4) It matters whether other regressors are present in the specification or not.
5) It matters how the above 4 issues are combined.

I will now present a set of sufficient conditions for perfect separation to make the MLE break-down. This is unrelated to whether the various statistical softwares give warning of the phenomenon -they may do so by scanning the data sample prior to attempting to execute maximum likelihood estimation. I am concerned with the cases where the maximum likelihood estimation will begin -and when it will break down in the process.

Assume a "usual" binary-choice logistic regression model

$$P(Y_i \mid \beta_0, X_i, \mathbf z_i) = \Lambda (g(\beta_0,x_i, \mathbf z_i)), \;\; g(\beta_0,x_i, \mathbf z_i) = \beta_0 +\beta_1x_i + \mathbf z_i'\mathbf \gamma$$

$X$ is the regressor with perfect separation, while $\mathbf Z$ is a collection of other regressors that are not characterized by perfect separation. Also

$$\Lambda (g(\beta_0,x_i, \mathbf z_i)) = \frac 1{1+e^{-g(\beta_0,x_i, \mathbf z_i)}}\equiv \Lambda_i$$

The log-likelihood for a sample of size $n$ is

$$\ln L=\sum_{i=1}^{n}\left[y_i\ln(\Lambda_i)+(1-y_i)\ln(1-\Lambda_i)\right]$$

The MLE will be found by setting the derivatives equal to zero. In particular we want

$$ \sum_{i=1}^{n}(y_i-\Lambda_i) = 0 \tag{1}$$

$$\sum_{i=1}^{n}(y_i-\Lambda_i)x_i = 0 \tag{2}$$

The first equation comes from taking the derivative with respect to the constant term, the 2nd from taking the derivative with respect to $X$.

Assume now that in all cases where $y_1 =1$ we have $x_i = a_k$, and that $x_i$ never takes the value $a_k$ when $y_i=0$. This is the phenomenon of complete separation, or "perfect prediction": if we observe $x_i = a_k$ we know that $y_i=1$. If we observe $x_i \neq a_k$ we know that $y_i=0$. This holds irrespective of whether, in theory or in the sample, $X$ is discrete or continuous, dichotomous or not. But also, this is a sample-specific phenomenon -we do not argue that it will hold over the population. But the specific sample is what we have in our hands to feed the MLE.

Now denote the abolute frequency of $y_i =1$ by $n_y$

$$n_y \equiv \sum_{i=1}^ny_i = \sum_{y_i=1}y_i \tag{3}$$

We can then re-write eq $(1)$ as

$$n_y = \sum_{i=1}^n\Lambda_i = \sum_{y_i=1}\Lambda_i+\sum_{y_i=0}\Lambda_i \Rightarrow n_y - \sum_{y_i=1}\Lambda_i = \sum_{y_i=0}\Lambda_i \tag{4}$$

Turning to eq. $(2)$ we have

$$\sum_{i=1}^{n}y_ix_i -\sum_{i=1}^{n}\Lambda_ix_i = 0 \Rightarrow \sum_{y_i=1}y_ia_k+\sum_{y_i=0}y_ix_i - \sum_{y_i=1}\Lambda_ia_k-\sum_{y_i=0}\Lambda_ix_i =0$$

using $(3)$ we have $$n_ya_k + 0 - a_k\sum_{y_i=1}\Lambda_i-\sum_{y_i=0}\Lambda_ix_i =0$$

$$\Rightarrow a_k\left(n_y-\sum_{y_i=1}\Lambda_i\right) -\sum_{y_i=0}\Lambda_ix_i =0$$

and using $(4)$ we obtain

$$a_k\sum_{y_i=0}\Lambda_ix_i -\sum_{y_i=0}\Lambda_ix_i =0 \Rightarrow \sum_{y_i=0}(a_k-x_i)\Lambda_i=0 \tag {5}$$

So : if the specification contains a constant term and there is perfect separation with respect to regressor $X$, the MLE will attempt to satisfy, among others, eq $(5)$ also.

But note, that the summation is over the sub-sample where $y_i=0$ in which $x_i\neq a_k$ by assumption. This implies the following:
1) if $X$ is dichotomous in the sample, then $(a_k-x_i) \neq 0$ for all $i$ in the summation in $(5)$.
2) If $X$ is not dichotomous in the sample, but $a_k$ is either its minimum or its maximum value in the sample, then again $(a_k-x_i) \neq 0$ for all $i$ in the summation in $(5)$.

In these two cases, and since moreover $\Lambda_i$ is non-negative by construction, the only way that eq. $(5)$ can be satisfied is when $\Lambda_i=0$ for all $i$ in the summation. But

$$\Lambda_i = \frac 1{1+e^{-g(\beta_0,x_i, \mathbf z_i)}}$$

and so the only way that $\Lambda_i$ can become equal to $0$, is if the parameter estimates are such that $g(\beta_0,x_i, \mathbf z_i) \rightarrow -\infty$. And since $g()$ is linear in the parameters, this implies that at least one of the parameter estimates should be "infinity": this is what it means for the MLE to "break down": to not produce finite valued estimates. So cases 1) and 2) are sufficient conditions for a break-down of the MLE procedure.

But consider now the case where $X$ is not dichotomous, and $a_k$ is not its minimum, or its maximum value in the sample. We still have complete separation, "perfect prediction", but now, in eq. $(5)$ some of the terms $(a_k-x_i)$ will be positive and some will be negative. This means that it is possible that the MLE will be able to satisfy eq. $(5)$ producing finite estimates for all parameters. And simulation results confirm that this is so.

I am not saying that such a sample does not create undesirable consequences for the properties of the estimator etc: I just note that in such a case, the estimation algorithm will run as usual.

Moreover, simulation results show that if there is no constant term in the specification, $X$ is not dichotomous but $a_k$ is an extreme value, and there are other regressors present, again the MLE will run -indicating that the presence of the constant term (whose theoretical consequences we used in the previous results, namely the requirement for the MLE to satisfy eq. $(1)$), is important.