Solved – Case-control Matched Clustering in Generalized Estimation Equation (GEE) (R:geeglm)

case-control-studycluster-samplegeneralized-estimating-equationsmixed model

Question:

I have matched case-control data and I would like to take advantage of that in my GEE analysis.

In the standard approach to GEE analysis, we call each subject a cluster and fit subject-specific intercepts (please correct me if I'm wrong, this is my current understanding) to control for subject-specific variation.

I would like to make the case-control matchings my clusters, then I can control for variation related to the matching variable(s) that my matching controls for; this seems ideal. Clustering on the matching groups should also control for time (in the same way subject-clustering does) since each subject in each matched-grouping has temporal data.

Note: The matching contains 8 cases, 51 controls. There are approximately 12 controls matched to each case.

One objection I can think of:

No, cluster number, GEE requires that $c>50$ but $c>100$ is preferable. Clustering based on case-control matchings will give $c<<50$.

Example in R

data =
time  x   y  Disease Subject case_control_grouping
1    .2  .3        0       A                     1
2    .3  .4        0       A                     1
1    .5  .7        1       B                     1
2    .6  .4        1       B                     1
1     .   .        0       C                     2
2     .   .        0       C                     2
1     .   .        1       D                     2
2     .   .        1       D                     2
1     .   .        0       E                     2
2     .   .        0       E                     2


library(geeglm)
standard_clustering = geeglm( Disease ~ time + x + y ,data = data,
    id=Subject,
    correlation = 'exchangeable',family=binomial, std.err='san.se')

library(geeglm)
case_control_clustering = geeglm( Disease ~ time + x + y ,data = data,
    id=case_control_grouping,
    correlation = 'exchangeable',family=binomial, std.err='san.se')

Why GEE?

The data is longitudinal so we needed a model that could account for multiple subjects with longitudinal observations and replicates (marginal model?). The temporal observations, $x$, are correlated to the nonparametric noise, $\epsilon$, so we wanted a population averaging approach to keep our estimator unbiased and consistent.

Why cluster by case-control matches rather than subject?

Clustering by subject is standard practice in marginal models. The subjects within a case-control match should be more similar within the matching than across matchings

$m_1 = \{s_1,s_2…\}\\
m_2 = \{s_3,s_4…\}\\
Cov(s_1,s_2) > Cov(s_1,s3)$

Therefore, clustering by case-control matching should better satisfy the assumptions 1 and 2 of "cluster data:" 1) observations within a cluster may be correlated, 2) observations in separate clusters are independent, 3) a monotone transformation of the expectation is linearly related to the explanatory variables, 4) the variance is a function of the expectation. (Halekoh,2006,Introduction). I think this will improve the third assumption as well, because my data is not guaranteed continuous (because the observation window is larger than the period of the data).

Why not clogit?

Conditional logistic regression is a common model for logistic modeling of case-control data. I was not able to find a population-averaging implementation of clogit.

Best Answer

When we speak of clustering, we think of subjects who are probabilistically correlated in terms of the outcome in ways beyond what are explained in the risk factors for the model. In case-control designs, treating case-control matched pairs as "correlated" makes no sense because it was the very outcome itself that determined the matching, therefore it is a deterministic matching. Indeed, the correlation matrix between their outcomes is exactly negative 1 (if one is a case, the others are controls regardless of the distribution of risk factors) and this gives you a singular correlation matrix. So NO an exchangeable correlation matrix is a BAD idea.

For modeling risk factors in a case-control study, you have two approaches:

  1. Use a logistic regression model for the outcome. The model effects are odds ratios, for which the odds ratio comparing the odds of disease in exposed and unexposed subjects is the same as the odds ratio comparing the odds of exposure in diseased and nondiseased subjects. (Breslow and Day 1972). The fitted risk arising from such a model only estimates the probability of being included in the study as a case.
  2. Use inverse probability weighting to account for the sampling probabilities for controls (a fraction of cases usually) to estimate the same associations in a logistic model (or use any other model for binary outcome such as relative risk regression or poisson regression). The fitted risk arising from such a weighted model estimates population level risk of outcome, but is less precise and depends on untenable assumptions.

What is the rationale for using GEE for logistic regression models for case-control outcomes? Consistent estimation with model misspecification. For logistic regression models, noncollapsibility of the odds ratio will lead to biased estimates when omitting risk factors. And every binary outcome has several unmeasured risk factors in virtually all studies considered. With the robust standard error (or sandwich based error) for GEEs, you will obtain a confidence interval that appropriately summarizes the population averaged odds ratio for an effect when omitting risk factors from the model formulation. (See Greenland 1999 and Miettinen 1972). The independence correlation structure should be used to obtain an efficient sandwich based estimate of standard error for the GEE in case-control studies. See the sandwich package and help file for vcovHC to get examples of code usage.

Related Question