I'm measuring a single binary outcome, with independent variables:
1) Treatment versus control. Each participant is one or the other.
2) "Before" versus "after" — each participant has their outcome measured both before and after an interview.
3) Demographic variables such as age and sex, which I may or may not include in the model.
Since this is paired data, I need something that isn't ordinary logistic regression, so I'm doing conditional logistic regression, with the strata being the participants. However, when I run conditional logistic regression in SAS (minimal code below) I get the messages that: "the conditional distribution is degenerate" and "ERROR: All explanatory variables are dependent on the strata."
My questions:
1) What is SAS trying to tell me?
2) If the problem is separation, of course some variables — such as whether the participant is in the control versus the treatment group — are completely predicted by the subject ID. Does this mean that conditional logistic regression is not usable in this context?
3) BUT, from what I understand, exact regression is supposed to be a solution to the issue of separation (i.e., empty cells). So why is this an issue?
4) I'm open to being told that I really should use GEEs or GLMs for this, but then I'd like to understand why conditional logistic regression isn't appropriate.
SAS code
First, simulate some data:
/* subject: unique to participant, two measurements per subject.
treatment: 0/1, control versus treatment group
after: 0/1, for measurement before versus after interview
baseP: intercept for probability of outcome,
varies by subject. Random unif(0.4, 0.7)
p: probability of outcome.
OC: outcome, 0/1.
nPoints: number of datapoints to simulate.
beta1: coefficient for treatment.
beta2: coefficient for before/after.
*/
%let beta1 = 1.25;
%let beta2 = -0.65;
%let nPoints = 24;
data dataset;
call streaminit(1);
do subject = 1 to &nPoints/2;
treatment = (subject > &nPoints/4);
baseP = RAND("unif") * 0.4 + 0.3;
do after = 0 to 1;
beta0 = log(baseP / (1 - baseP));
logOdds = beta0 + &beta1*treatment + &beta2*after;
p = exp(logOdds) / (exp(logOdds) + 1);
OC = (RAND("uniform") < p);
output;
end;
end;
run;
We can look at the data:
proc print
data = dataset
noobs;
var subject treatment after p OC;
run;
subject treatment after p OC
1 0 0 0.65355 0
1 0 1 0.49617 0
2 0 0 0.65478 0
2 0 1 0.49753 0
3 0 0 0.67151 1
3 0 1 0.51626 0
4 0 0 0.65086 1
4 0 1 0.49320 0
5 0 0 0.62938 0
5 0 1 0.46992 0
6 0 0 0.34572 0
6 0 1 0.21620 0
7 1 0 0.71013 0
7 1 1 0.56120 0
8 1 0 0.86254 1
8 1 1 0.76612 1
9 1 0 0.80129 1
9 1 1 0.67796 1
10 1 0 0.63276 0
10 1 1 0.47354 0
11 1 0 0.82020 1
11 1 1 0.70426 1
12 1 0 0.72707 1
12 1 1 0.58172 0
Finally, the regression code
proc logistic
data = dataset;
strata subject;
class treatment (ref="0")
/ param=ref;
model OC(event="1") = treatment after;
exact treatment after / estimate=both;
run;
With log results:
NOTE: Convergence criterion (ABSGCONV=0) satisfied.
NOTE: Linear dependency among the parameters has been detected. Iterations will restart.
ERROR: All explanatory variables are dependent on the strata.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 24 observations read from the data set WORK.DATASET.
And selected results:
Exact Parameter Estimates
Parameter Estimate Standard Error 95% Confidence Limits Two-sided p-Value
treatment 1 . # . . . .
after -1.3474 * . -Infinity 0.5391 0.2500
Note: # indicates that the conditional distribution is degenerate.
* indicates a median unbiased estimate.
So what's happening?
Replying to @DJohnson:
Yes, that Alison article is a great resource, and I've actually been staring at it for the last few days. I can't see any separation, though. If you look,
proc freq
data = dataset;
tables treatment*OC after*OC treatment*after*OC;
run;
The only cell with no outcome=1 is treatment=0, after=1. If you change one of those 6 datapoints to outcome=1,
data dataset2;
set dataset;
if (treatment=0 & after=1 & subject=3) then do;
OC=1;
end;
run;
regression still gets the exact same error.
Also, as Alison says,
Exact logistic regression is designed to produce exact p-values for the null hypothesis that a specified predictor variable has a coefficient of 0, conditional on all the other predictors. These p-values, based on permutations of the data rather than on large-sample chi-square approximations, are essentially unaffected by complete or quasi-complete separation.
so why would this be a problem anyway?
Best Answer
You have provided abundant documentation regarding the errors SAS is giving you. Paul Allison's excellent and clearly articulated SAS proceedings paper -- http://www2.sas.com/proceedings/forum2008/360-2008.pdf -- goes into great detail about the reasons for any failures of maximum likelihood estimation. To me it sounds like your strata are linear combinations of Y, the other predictors, or both. Why not do a Proc Freq using the LIST option and look for it that way?