Solved – Using conditional logistic regression for repeated measures, complete separation (and secondarily, proc logistic)

logisticsasseparation

I'm measuring a single binary outcome, with independent variables:

1) Treatment versus control. Each participant is one or the other.

2) "Before" versus "after" — each participant has their outcome measured both before and after an interview.

3) Demographic variables such as age and sex, which I may or may not include in the model.

Since this is paired data, I need something that isn't ordinary logistic regression, so I'm doing conditional logistic regression, with the strata being the participants. However, when I run conditional logistic regression in SAS (minimal code below) I get the messages that: "the conditional distribution is degenerate" and "ERROR: All explanatory variables are dependent on the strata."

My questions:

1) What is SAS trying to tell me?

2) If the problem is separation, of course some variables — such as whether the participant is in the control versus the treatment group — are completely predicted by the subject ID. Does this mean that conditional logistic regression is not usable in this context?

3) BUT, from what I understand, exact regression is supposed to be a solution to the issue of separation (i.e., empty cells). So why is this an issue?

4) I'm open to being told that I really should use GEEs or GLMs for this, but then I'd like to understand why conditional logistic regression isn't appropriate.

SAS code

First, simulate some data:

/*  subject: unique to participant, two measurements per subject.
    treatment: 0/1, control versus treatment group
    after: 0/1, for measurement before versus after interview
    baseP: intercept for probability of outcome,
        varies by subject. Random unif(0.4, 0.7)
    p: probability of outcome.
    OC: outcome, 0/1.

    nPoints: number of datapoints to simulate.
    beta1: coefficient for treatment.
    beta2: coefficient for before/after.
*/

%let beta1 = 1.25;
%let beta2 = -0.65;
%let nPoints = 24;
data dataset;
    call streaminit(1);
    do subject = 1 to &nPoints/2;
        treatment = (subject > &nPoints/4);
        baseP = RAND("unif") * 0.4 + 0.3;
        do after = 0 to 1;
            beta0 = log(baseP /  (1 - baseP));
            logOdds = beta0 + &beta1*treatment + &beta2*after;
            p = exp(logOdds) / (exp(logOdds) + 1);
            OC = (RAND("uniform") < p);
            output;
        end;
    end;
run;

We can look at the data:

proc print
    data = dataset
    noobs;
    var subject treatment after p OC;
run;

subject    treatment  after      p                OC          
1          0          0          0.65355          0          
1          0          1          0.49617          0          
2          0          0          0.65478          0          
2          0          1          0.49753          0          
3          0          0          0.67151          1          
3          0          1          0.51626          0          
4          0          0          0.65086          1          
4          0          1          0.49320          0          
5          0          0          0.62938          0          
5          0          1          0.46992          0          
6          0          0          0.34572          0          
6          0          1          0.21620          0          
7          1          0          0.71013          0          
7          1          1          0.56120          0          
8          1          0          0.86254          1          
8          1          1          0.76612          1          
9          1          0          0.80129          1          
9          1          1          0.67796          1          
10         1          0          0.63276          0          
10         1          1          0.47354          0          
11         1          0          0.82020          1          
11         1          1          0.70426          1          
12         1          0          0.72707          1          
12         1          1          0.58172          0          

Finally, the regression code

proc logistic
    data = dataset;
    strata subject;
    class treatment (ref="0")
        / param=ref;
    model OC(event="1") = treatment after;
    exact treatment after / estimate=both;
run;

With log results:

NOTE: Convergence criterion (ABSGCONV=0) satisfied.
NOTE: Linear dependency among the parameters has been detected.  Iterations will restart.
ERROR: All explanatory variables are dependent on the strata.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 24 observations read from the data set WORK.DATASET.

And selected results:

Exact Parameter Estimates

Parameter   Estimate       Standard Error   95% Confidence Limits   Two-sided p-Value

treatment 1 .          #   .                .          .            . 

after       -1.3474    *   .                -Infinity   0.5391      0.2500 

Note: # indicates that the conditional distribution is degenerate.
* indicates a median unbiased estimate. 

So what's happening?


Replying to @DJohnson:

Yes, that Alison article is a great resource, and I've actually been staring at it for the last few days. I can't see any separation, though. If you look,

proc freq
    data = dataset;
    tables treatment*OC after*OC treatment*after*OC;
run;

The only cell with no outcome=1 is treatment=0, after=1. If you change one of those 6 datapoints to outcome=1,

data dataset2;
    set dataset;
    if (treatment=0 & after=1 & subject=3) then do;
        OC=1;
    end;
run;

regression still gets the exact same error.

Also, as Alison says,

Exact logistic regression is designed to produce exact p-values for the null hypothesis that a specified predictor variable has a coefficient of 0, conditional on all the other predictors. These p-values, based on permutations of the data rather than on large-sample chi-square approximations, are essentially unaffected by complete or quasi-complete separation.

so why would this be a problem anyway?

Best Answer

You have provided abundant documentation regarding the errors SAS is giving you. Paul Allison's excellent and clearly articulated SAS proceedings paper -- http://www2.sas.com/proceedings/forum2008/360-2008.pdf -- goes into great detail about the reasons for any failures of maximum likelihood estimation. To me it sounds like your strata are linear combinations of Y, the other predictors, or both. Why not do a Proc Freq using the LIST option and look for it that way?

Related Question