I would like to do a gene x environment interaction analysis in a matched (1-1) case control samples. I referred all related previous publications and in most of the papers authors used either STATA or SAS. I got few references for performing conditional logisitic regression in R, for example using survival (clogit) package. But I couldn't find any reference for adding interaction terms in conditional logistic models in R. Can someone help me with references for interaction analysis using conditional logistic regression in R?
Solved – interaction term in conditional logistic regression
clogitinteractionlogitrregression
Related Solutions
I don't agree that you sampled on the outcome, since you sampled on company and enrollment is your outcome. You may want to deal with the company as a random effect and the other features as fixed effects. So I am suggesting yet a third alternative: generalized mixed models.
After clarification: If the outcome is company enrollment rather than employee enrollment, then it is an ordinary case-control study for which unconditional logistic regression should be the standard approach. Conditional logistic regression is not necessary unless there were further conditions on the sampling regarding other company features.
Further clarification: If you were using R, then the package to identify and install would be not surprisingly: "sampling" by Thomas Lumley. It provides for the appropriate incorporation of the two-way sampling strategy you have outlined in the design phase prior to estimation with the svyglm() function. Stata also has a set of survey functions and I imagine they can also be used with the general linear modeling functions it provides. SAS didn't have such facilities in the past so the SUDAAN program was needed as an added (expensive) purchase, but I have a vague memory that this may have changed with its latest releases. (I don't know about SPSS with regard to sampling support for GLM models.)
Your reference says that clogit
is a special form of Cox regression, not the GLMM. So you are probably mixing things up.
The conditional logit log-likelihood is (reverse engineering the LaTeX code from the Stata manual): conditional on $\sum_{j=1}^{n_i} y_{ij} = k_{1i}$, $$ {\rm Pr}\Bigl[(y_{i1},\ldots,y_{i{n_i}})|\sum_{j=1}^{n_i} y_{ij} = k_{1i}\Bigr] = \frac{\exp(\sum_{j=1}^{n_i} y_{ij} x_{ij}'\beta)}{\sum_{{\bf d}_i\in S_i}\exp(\sum_{j=1}^{n_i} y_{ij} x_{ij}'\beta)} $$ where $S_i$ is a set of all possible combinations of $n_i$ binary outcomes, with $k_{1i}$ ones and remaining zeroes, so the summation index-vector has components $d_{ij}$ that are 0/1 with $\sum_{i=1}^{n_i} d_{ij} = k_{1i}$. That's a pretty weird likelihood to me. Denoting the denominator as $f_i(n_i,k_{1i})$, the conditional log-likelihood is $$ \ln L = \sum_{i=1}^n \biggl[ \sum_{j=1}^{n_i} y_{ij} x_{ij}'\beta - \ln f_i(n_i, k_{1i}) \biggr] $$ This likelihood can be computed exactly, although the computational time goes up steeply as $p^2 \sum_{i=1}^n n_i \min(k_{1i}, n_i - k_{1i})$ where $p={\rm dim}\, \beta = {\rm dim}\, x_{ij}$. This is the likelihood that should be identical to the stratified Cox regression, which I won't try to entertain here.
The mixed model likelihood (again, adopting from Stata manuals) is based on integrating out the random effects:
$$
{\rm Pr}(y_{i1}, \ldots, y_{1{n_i}} |x_{i1}, \ldots, x_{i{n_i}})=\int_{-\infty}^{+\infty} \frac{\exp(-\nu_i^2/2\sigma_\nu^2)}{\sigma_\nu \sqrt{2\pi}} \prod_{i=1}^{n_i}F(y_{ij}, x_{ij}'\beta + \nu_i)
$$
where $
F(y,z) = \Bigl\{ 1+\exp\bigl[ (-1)^y z \bigr] \Bigr\}^{-1}
$ is a witty way to write down the logistic contribution for the outcome $y=0,1$. This likelihood cannot be computed exactly, and in practice is approximated numerically using a set of Gaussian quadrature points with abscissas $a_m$ and weights $w_m$ resembling the density of the standard normal density on a grid, producing (in the simplest version)
$$
\ln L \approx \sum_{i=1}^n \ln\biggl[ \sqrt{2} \sum_{m=1}^M w_m \frac{1}{\sigma_\nu \sqrt{2\pi}} \prod_{i=1}^{n_i}F(y_{ij}, x_{ij}'\beta + \sqrt{2} \sigma_\nu a_m) \biggr]
$$
(The $\exp(\nu_i^2)$-like terms disappear due to the full quadrature formula, but since it is designed for the physicist' erf()
function rather than statisticians' $\Phi()$ function, it works with $\exp(-z^2)$ rather than $\exp(-z^2/2)$; hence the weird $\sqrt{2}$ in a couple of places.) Computational time for $\ln L$ itself is proportional to $nM$, but since you need to take the second order derivatives for Newton-Raphson, feel free to multiply by $p^2$. Smarter computational schemes aka adaptive Gaussian quadratures try to find a better location and scale parameters for the quadrature to make the approximation more accurate.
In fact, that latter Stata manual describes the differences between the GLMM (aka random effect xtlogit
, in econometric slang) and conditional logit (aka fixed effect xtlogit
), and might be worth a more serious reading.
Related Question
- Solved – How to calculate Prob > chi2 in R to test model fit of conditional logistic regression
- Solved – Parallel logistic regression
- Solved – How to be sure the sample size is large enough for conditional logistic regression
- Multinomial Logit – Understanding Interaction Between Alternative-Specific and Individual-Specific Variables in Conditional Logistic Regression
Best Answer
As stated here (http://www.ats.ucla.edu/stat/stata/library/sg124.pdf), interaction or effect modification...is performed by including and evaluating the significance of second or higher order terms involving the two or more variables that are postulated to possibly modify their respective effects.
I´m trying something similar with clogit() in R but have not found much info about it in the web, except this link (https://stackoverflow.com/questions/20977401/coxph-x-matrix-deemed-to-be-singular) talking about the problems/errors found when using interaction terms with the function coxph(). Since clogit() is a wrap-up of coxph(), I thought this could be useful.