Solved – Mixing variable types in latent class/profile analysis

latent-classlatent-variablestata

Is there a way to use both binary and continuous variables in latent class/profile analysis? (Class being binaries, and profile being continuous, not sure what to call this.)

My lone continuous variable is really important, and making it dichotomous does not make sense theoretically.

(P.S. I'm using Stata 15's new gsem commands.)

Best Answer

Yes - below is an image taken from a LCA presentation by Chuck Huber from Stata. Note inclusion of a covariate.

You may be able to access the entire presentation from this link:

https://www.stata.com/training/webinar_series/latent-class-analysis/

Updated Response

Here is a general classification by indicator (manifest variable) and latent variable. 

Indicator    Latent           Analysis
Variable    Variable          Name
------------------------------------------
continuous  continuous    = (Latent) Factor Analysis
continuous  categorical   = Latent Profile Analysis
categorical continuous    = Latent Trait Analysis (also IRT)
categorical categorical   = Latent Class Analysis

From general structural models we know that both continuous and categorical can be used to predict latent variables, i.e.,

continuous -> latent variable
categorical -> latent variable

The question is whether one can mix indicator variables of different types to form the latent variable, i.e., 

latent variable -> continuous + categorical indicator variables

The answer is yes, and the general framework is sometimes call latent structure analysis or mixture models. Here are introductions.

https://www.statmodel.com/download/2006catcont1MBR.pdf
https://hummedia.manchester.ac.uk/institutes/methods-manchester/docs/lsa.pdf

Examples are shown in Mplus software. For instance, to see confirmatory factor analysis with both categorical and continuous indicators, see example 5.3 in user manual chapter 5. 

https://www.statmodel.com/ugexcerpts.shtml

Related Solutions

Solved – Interpretation of regression coefficients in latent class regression (using poLCA in R)

My understanding, which may be wrong, is this - the parameters reported as the PCV2 coefficients are the log-odds of being in that class (class 2 or class 3) as opposed to class 1, given that the pig has detectable virus (VirusCount = 1).

Hence the relevant Odds ratios (which is what my collaborators will understand) is 𝑒𝑥𝑝(−0.78575) or about 0.455 and 𝑒𝑥𝑝(1.92351)

or about 6.8. I don't understand the intercept terms, and any explanation of these would be welcomed.

I believe your interpretation of the coefficients as log odds is correct (and if you exponentiate them, you get relative risk ratios). The intercepts presented in your output are the multinomial intercepts.

If you don't know what those intercepts are, they estimate the fraction of people in each class. Backing up, for a 3-class latent class model, i.e. no regression component, the multinomial part of the model looks like this:

$$P(C = 1) = \frac{e^{\gamma_1}}{e^{\gamma_1} + e^{\gamma_2} + e^{\gamma_3}}$$ $$= \frac{1}{1 + e^{\gamma_2} + e^{\gamma_3}}$$ You have to set one class's intercept, i.e. the base class, to 1. More generally,

$$P(C = k) = \frac{exp(\gamma_k)}{\sum_{j=1}^K exp(\gamma_j)}$$ (where i indexes respondents)

Had you fit a latent class regression, it would look something like:

$$P(c_i = k|x_i) = \frac{exp(\gamma_{0k} + \gamma_{1k}x_i)}{\sum_{j=1}^K exp(\gamma_{0j} + \gamma_{1j}x_i)}$$

(in your example, x represents having a detectable viral load; in the poLCA example, it represents the strength of party identification)

So, the coefficients on in your output are the values of $\gamma_{1k}$ for each latent class (k = 2 and 3). The intercepts you referred to are just the multinomial intercepts, i.e. the $\gamma_{0k}$. Had you omitted viral load as a covariate, your model would have calculated those intercepts anyway, it's just that I believe poLCA doesn't display them. If you'd fit the model in Stata (my usual software), the header table would have included the multinomial intercepts, as I described on Statalist.

Edit: Expanding my answer to include predicted probabilities

This second answer centers more around R syntax. The code block that @astaines quoted stems from an example in the poLCA manual, which I'm expanding here:

This block fits a latent class regression to a dataset from the 2000 US Presidential election, using the strength of party identification (1-7 discrete variable, here treated as quasi-continuous) as the predictor. require(poLCA) data(election) f.party <- cbind(MORALG,CARESG,KNOWG,LEADG,DISHONG,INTELG, MORALB,CARESB,KNOWB,LEADB,DISHONB,INTELB)~PARTY nes.party <- poLCA(f.party,election,nclass=3,verbose=F)

#next line displays model coefficients
nes.party$coeff

#create a matrix with (what I believe is) the linear predictors in the LCR equation, 
#for party ID = (1, 2, 3 ... 7)
pidmat <- cbind(1,c(1:7))
exb <- exp(pidmat %*% nes.party$coeff)

#Next code should be the predicted probability, P(C = c|party ID)
a <- (cbind(1,exb)/(1+rowSums(exb)))

matplot(c(1:7),a),
        main="Party ID as a predictor of candidate affinity class",
        xlab="Party ID: strong Democratic (1) to strong Republican (7)",
        ylab="Probability of latent class membership",
        ylim=c(0,1),type="l",lwd=3,col=1)
text(5.9,0.35,"Other")
text(5.4,0.7,"Bush affinity")
text(1.8,0.6,"Gore affinity")

I'm not an R expert, but I believe this is what the code is trying to do. If this were Stata, the margins command makes the sausage for you. In R, I suppose you have to make the sausage yourself.

Latent Analysis – When Are Latent Analyses Useful?

Latent variable analyses, such as factor analysis, are useful when we want to analyze a construct that we can't measure directly in a single question, but which we think MIGHT be imperfectly measured by a whole bunch of different questions. They can be especially helpful if we're not even sure that the thing we want to measure even exists, but we need to find that out.

Here's an example. We think that people might suffer from this thing we are calling "depression," but we don't really know how to measure it, or even if it's just one thing - maybe there are a bunch of different states that we CALL depression but which are really distinct constructs. So how do we proceed? Well we can start by coming up with a list of questions that we think MIGHT measure depression:

Do you often feel sad? Do you have little interest in pleasure or doing things? Do you often feel tired or have little energy? Do you think about hurting yourself? Do you have trouble concentrating?

Of course, some of these questions might not actually measure depression (they might measure anxiety or something else). And some might be better measures than others. But that's what we're going to figure out.

Our theory is that there is some underlying construct "depression" that CAUSES people to give the answers to these questions that they do. If that's true then these variables should all correlate with each other, because they're being influenced by the same thing. If one or more of these variables doesn't strongly correlate with the others, then it's probably NOT being influenced by the same thing as the others (which we assume is depression).

So we throw all of these variables into an exploratory factor analysis. The FA tries to find out of there is one or more underlying latent factor related to all of these items and then it tells us how closely correlated each item is to the underlying factor. Let's say we do this and find that the FA finds only one strong factor, and all of the items "load" pretty strongly on that factor, this strongly suggests that the underlying latent variable is actually "depression." If one item didn't load on it then we would know that item is measuring something else, and could kick it out of the analysis.

Furthermore, since the results of the analysis tell us HOW strongly each item is correlated with the latent variable we can use this information to combine the items together into a new variable that measures depression better than any one item in isolation. We have therefore created a single observed measure of what was previously a unobserved construct that we weren't even sure existed.

This is one of the uses of latent variable analysis.

Another use is if you aren't sure of the structure of the latent variable. For example, if you are interested in "political ideology" but aren't sure if it's just a single "left right" scale, or if there are distinct "economic" and "social" dimensions. To figure this out ask a bunch of questions about both economic and social issues, and throw them all into a factor analysis. Is there just one factor or two? Or three?

(caveat: I'm really only talking about factor analysis or things like latent class analysis here. PCA has a somewhat different logic to it and isn't really designed for latent variable analysis per se from a theoretical perspective, even though you can use it for that. But that's another discussion)