Solved – How to you implement latent class analysis with distal outcomes in R

latent-classr

I am looking to fit a fairly straightforward latent class analysis (LCA) model to derive phenotypes / clusters of a disease (in R). My dataset contains the manifest variables used to derive the clusters (as in any other LCA model), which are categorical.

Found packages that do the trick in deriving the classes and that can also incorporate covariates in a regression to see whether they are related to the classes (i.e., poLCA).

My problem is that I haven't found a way to incorporate distal outcomes in any R packages that deal with LCA. Does anyone have any experience in implementing distal outcomes in any existing R packages or any way to get around to this?

Best Answer

I don't know of many statistical software packages in general that implement latent class analysis with distal outcomes.

What are distal outcomes? (Compare to latent class regression)

Just to clarify for readers: in latent class analysis, we're saying that we have a number of indicators $X_j$. We assume there's a latent, unordered categorical variable that causes responses to those indicators, and that it has $k$ classes. We iteratively increase $k$, and for each value of $k$, we estimate the means of all the $X_j$s. Formally, for the usual application of latent class analysis with binary indicators, we'd estimate: \begin{align} P(X_j = 1 | C = k) &= {\rm logit}^{-1}(\alpha_{jk}) \\[5pt] P(C = k) &= \frac{\exp(\gamma_k)}{\sum^C_{j=1}\gamma_j} \end{align} $\gamma_1 = 0$ because you need a base class.

In latent class regression, we're asking how does the level of some covariate $Z$ change the probability of being in each latent class:

$$P(C = k | Z) = \frac{\exp(\gamma_k + \gamma Z)}{\sum^C_{j=1}\exp(\gamma_j + \gamma Z)}$$

However, sometimes we aren't interested in that. We may instead be interested in the level of $z$ given $C$, e.g.

$$P(Z = z | C) = ???$$

Estimating that association isn't a trivial problem, because the latent class model doesn't estimate that directly.

(Note, the above notation generalizes to situations with continuous $X$s and $Z$s; I used the probability notation for simplicity, and because LCA started with binary indicators.)

Did you read the paragraph above and wonder about Bayes' Theorem?

So did Stephanie Lanza and colleagues (2013, from Penn State University, link goes to a free PubMed article).

$$P(Z = z | C = c) = \frac{P(Z = z) P(C = c|Z = z)}{P(C = c)}$$

We can get $P(C = k|Z = z)$ from a latent class regression, which poLCA can do (as can a number of other software packages). So, this is easy to do by hand if $Z$ is binary. The paper also states how to generalize this to categorical or count outcomes.

The thing is that if $Z$ is continuous, you need some assumption about the functional form of $f(Z)$. The authors wrote add-on packages for SAS (PROC LCA) and Stata that use kernel density estimation for $f(Z)$. I'm not sure how to implement that in R, and I'm not sure which (if any) latent class packages implement this.

Other methods to estimate the relationship to a distal outcome

Frequently, you will see people determine which class each observation is most likely to belong to (i.e. assign observations to their modal class), but this is wrong, because it ignores our uncertainty about which class they belong to. I would suggest that if your latent classes are well-separated and the model is fairly certain about which class each observation belongs to (e.g. an entropy greater than about 0.8), this approach probably isn't too far wrong.

Alternatively, you (i.e., your software) estimated the probability that each observation belonged to each class. You could basically use this vector of probabilities to do multiple imputation, and then calculate $P(Z = z | C)$ or $E(z | C)$ using Rubin's rules. However, at minimum, you want to make sure you have enough quasi-imputations to cover the number of classes you estimated, accounting for their prevalences as well. What if you have 6 latent classes? What if one is very small (e.g., its estimated prevalence is 5%)? Lanza et al suggest a minimum of 20 imputations, but in some contexts, I have wondered if you would need more.

Lanza et al. seem to argue that their approach is optimal, and it does make sense at face value.

MPlus implements the multiple imputation-like framework (no personal experience, but Lanza et al state this). I believe Latent Gold (commercial software) may implement quasi-MI also. Stata doesn't implement quasi-MI, but there is the Bayes Theorem framework implemented via the Penn State University plugin - albeit this program has lesser functionality than the latent class/profile commands implemented in Stata 15's gsem command. The R packages poLCA and flexmix don't implement the quasi-MI framework either, to my knowledge. It seems like you could manually compute the quasi-MI estimates yourself in R and Stata, though.

Use $P(C = k)^{-1}$ as inverse probability of treatment weights??

I've wondered if you can just take the predicted probabilities of each class membership, and use those like you would inverse probability of treatment weights in tabulating frequencies or calculating means. That is, you calculate the probability of Z or mean of Z, weighting by the inverse of the probability of class membership, then you repeat for each latent class. This is parallel to IPTW estimation in the causal inference framework and to inverse probability weighting in complex survey designs.

To my knowledge, nobody has suggested this approach. I assume that there's something I've missed, but seems like this should be asymptotically equivalent to the quasi-MI method. Anyone have thoughts on this?