Question:
I have matched case-control data and I would like to take advantage of that in my GEE analysis.
In the standard approach to GEE analysis, we call each subject a cluster and fit subject-specific intercepts (please correct me if I'm wrong, this is my current understanding) to control for subject-specific variation.
I would like to make the case-control matchings my clusters, then I can control for variation related to the matching variable(s) that my matching controls for; this seems ideal. Clustering on the matching groups should also control for time (in the same way subject-clustering does) since each subject in each matched-grouping has temporal data.
Note: The matching contains 8 cases, 51 controls. There are approximately 12 controls matched to each case.
One objection I can think of:
No, cluster number, GEE requires that $c>50$ but $c>100$ is preferable. Clustering based on case-control matchings will give $c<<50$.
Example in R
data =
time x y Disease Subject case_control_grouping
1 .2 .3 0 A 1
2 .3 .4 0 A 1
1 .5 .7 1 B 1
2 .6 .4 1 B 1
1 . . 0 C 2
2 . . 0 C 2
1 . . 1 D 2
2 . . 1 D 2
1 . . 0 E 2
2 . . 0 E 2
library(geeglm)
standard_clustering = geeglm( Disease ~ time + x + y ,data = data,
id=Subject,
correlation = 'exchangeable',family=binomial, std.err='san.se')
library(geeglm)
case_control_clustering = geeglm( Disease ~ time + x + y ,data = data,
id=case_control_grouping,
correlation = 'exchangeable',family=binomial, std.err='san.se')
Why GEE?
The data is longitudinal so we needed a model that could account for multiple subjects with longitudinal observations and replicates (marginal model?). The temporal observations, $x$, are correlated to the nonparametric noise, $\epsilon$, so we wanted a population averaging approach to keep our estimator unbiased and consistent.
Why cluster by case-control matches rather than subject?
Clustering by subject is standard practice in marginal models. The subjects within a case-control match should be more similar within the matching than across matchings
$m_1 = \{s_1,s_2…\}\\
m_2 = \{s_3,s_4…\}\\
Cov(s_1,s_2) > Cov(s_1,s3)$
Therefore, clustering by case-control matching should better satisfy the assumptions 1 and 2 of "cluster data:" 1) observations within a cluster may be correlated, 2) observations in separate clusters are independent, 3) a monotone transformation of the expectation is linearly related to the explanatory variables, 4) the variance is a function of the expectation. (Halekoh,2006,Introduction). I think this will improve the third assumption as well, because my data is not guaranteed continuous (because the observation window is larger than the period of the data).
Why not clogit
?
Conditional logistic regression is a common model for logistic modeling of case-control data. I was not able to find a population-averaging implementation of clogit.
Best Answer
When we speak of clustering, we think of subjects who are probabilistically correlated in terms of the outcome in ways beyond what are explained in the risk factors for the model. In case-control designs, treating case-control matched pairs as "correlated" makes no sense because it was the very outcome itself that determined the matching, therefore it is a deterministic matching. Indeed, the correlation matrix between their outcomes is exactly negative 1 (if one is a case, the others are controls regardless of the distribution of risk factors) and this gives you a singular correlation matrix. So NO an exchangeable correlation matrix is a BAD idea.
For modeling risk factors in a case-control study, you have two approaches:
What is the rationale for using GEE for logistic regression models for case-control outcomes? Consistent estimation with model misspecification. For logistic regression models, noncollapsibility of the odds ratio will lead to biased estimates when omitting risk factors. And every binary outcome has several unmeasured risk factors in virtually all studies considered. With the robust standard error (or sandwich based error) for GEEs, you will obtain a confidence interval that appropriately summarizes the population averaged odds ratio for an effect when omitting risk factors from the model formulation. (See Greenland 1999 and Miettinen 1972). The independence correlation structure should be used to obtain an efficient sandwich based estimate of standard error for the GEE in case-control studies. See the
sandwich
package and help file forvcovHC
to get examples of code usage.