I have a case-control study in which the cases are firms with health insurance and the controls are firms with no health insurance. I am studying the factors affecting enrolment in health insurance and was therefore using a logistic regression, which includes several covariates on firm characteristics that were measured in a survey. I have randomly sampled the firms from a database that includes two strata: insured and uninsured firms. I selected 65 from each group. However, within the group I also sampled from four strata that correspond to industry. I am therefore wondering if I need to use conditional logistic regression, as opposed to unconditional logistic regression. However, I was under the impression that conditional logistic regression was for matched case-control studies or panel studies. In other feedback I've received I've been told that because I sampled on the outcome, I need to use the conditional model. Could someone please help me figure out which m? Any references would also be much appreciated. Thank you.
Solved – How to decide between a logistic regression or conditional logistic regression
clogitlogisticsurvey
Related Solutions
Your reference says that clogit
is a special form of Cox regression, not the GLMM. So you are probably mixing things up.
The conditional logit log-likelihood is (reverse engineering the LaTeX code from the Stata manual): conditional on $\sum_{j=1}^{n_i} y_{ij} = k_{1i}$, $$ {\rm Pr}\Bigl[(y_{i1},\ldots,y_{i{n_i}})|\sum_{j=1}^{n_i} y_{ij} = k_{1i}\Bigr] = \frac{\exp(\sum_{j=1}^{n_i} y_{ij} x_{ij}'\beta)}{\sum_{{\bf d}_i\in S_i}\exp(\sum_{j=1}^{n_i} y_{ij} x_{ij}'\beta)} $$ where $S_i$ is a set of all possible combinations of $n_i$ binary outcomes, with $k_{1i}$ ones and remaining zeroes, so the summation index-vector has components $d_{ij}$ that are 0/1 with $\sum_{i=1}^{n_i} d_{ij} = k_{1i}$. That's a pretty weird likelihood to me. Denoting the denominator as $f_i(n_i,k_{1i})$, the conditional log-likelihood is $$ \ln L = \sum_{i=1}^n \biggl[ \sum_{j=1}^{n_i} y_{ij} x_{ij}'\beta - \ln f_i(n_i, k_{1i}) \biggr] $$ This likelihood can be computed exactly, although the computational time goes up steeply as $p^2 \sum_{i=1}^n n_i \min(k_{1i}, n_i - k_{1i})$ where $p={\rm dim}\, \beta = {\rm dim}\, x_{ij}$. This is the likelihood that should be identical to the stratified Cox regression, which I won't try to entertain here.
The mixed model likelihood (again, adopting from Stata manuals) is based on integrating out the random effects:
$$
{\rm Pr}(y_{i1}, \ldots, y_{1{n_i}} |x_{i1}, \ldots, x_{i{n_i}})=\int_{-\infty}^{+\infty} \frac{\exp(-\nu_i^2/2\sigma_\nu^2)}{\sigma_\nu \sqrt{2\pi}} \prod_{i=1}^{n_i}F(y_{ij}, x_{ij}'\beta + \nu_i)
$$
where $
F(y,z) = \Bigl\{ 1+\exp\bigl[ (-1)^y z \bigr] \Bigr\}^{-1}
$ is a witty way to write down the logistic contribution for the outcome $y=0,1$. This likelihood cannot be computed exactly, and in practice is approximated numerically using a set of Gaussian quadrature points with abscissas $a_m$ and weights $w_m$ resembling the density of the standard normal density on a grid, producing (in the simplest version)
$$
\ln L \approx \sum_{i=1}^n \ln\biggl[ \sqrt{2} \sum_{m=1}^M w_m \frac{1}{\sigma_\nu \sqrt{2\pi}} \prod_{i=1}^{n_i}F(y_{ij}, x_{ij}'\beta + \sqrt{2} \sigma_\nu a_m) \biggr]
$$
(The $\exp(\nu_i^2)$-like terms disappear due to the full quadrature formula, but since it is designed for the physicist' erf()
function rather than statisticians' $\Phi()$ function, it works with $\exp(-z^2)$ rather than $\exp(-z^2/2)$; hence the weird $\sqrt{2}$ in a couple of places.) Computational time for $\ln L$ itself is proportional to $nM$, but since you need to take the second order derivatives for Newton-Raphson, feel free to multiply by $p^2$. Smarter computational schemes aka adaptive Gaussian quadratures try to find a better location and scale parameters for the quadrature to make the approximation more accurate.
In fact, that latter Stata manual describes the differences between the GLMM (aka random effect xtlogit
, in econometric slang) and conditional logit (aka fixed effect xtlogit
), and might be worth a more serious reading.
You have provided abundant documentation regarding the errors SAS is giving you. Paul Allison's excellent and clearly articulated SAS proceedings paper -- http://www2.sas.com/proceedings/forum2008/360-2008.pdf -- goes into great detail about the reasons for any failures of maximum likelihood estimation. To me it sounds like your strata are linear combinations of Y, the other predictors, or both. Why not do a Proc Freq using the LIST option and look for it that way?
Best Answer
I don't agree that you sampled on the outcome, since you sampled on company and enrollment is your outcome. You may want to deal with the company as a random effect and the other features as fixed effects. So I am suggesting yet a third alternative: generalized mixed models.
After clarification: If the outcome is company enrollment rather than employee enrollment, then it is an ordinary case-control study for which unconditional logistic regression should be the standard approach. Conditional logistic regression is not necessary unless there were further conditions on the sampling regarding other company features.
Further clarification: If you were using R, then the package to identify and install would be not surprisingly: "sampling" by Thomas Lumley. It provides for the appropriate incorporation of the two-way sampling strategy you have outlined in the design phase prior to estimation with the svyglm() function. Stata also has a set of survey functions and I imagine they can also be used with the general linear modeling functions it provides. SAS didn't have such facilities in the past so the SUDAAN program was needed as an added (expensive) purchase, but I have a vague memory that this may have changed with its latest releases. (I don't know about SPSS with regard to sampling support for GLM models.)