Solved – LCA number of parameters & degrees of freedom

bicdegrees of freedomlatent-classmodel-based-clusteringr

I have a series of physicians' claims submissions. I would like to perform cluster analysis as an exploratory tool to find patterns in how physicians bill based on things like Revenue Codes, Procedure Codes, etc. The data are all polytomous, and from my basic understanding, a latent class algorithm is appropriate for this kind of data. I am trying my hand at some of R's cluster packages, & specifically poLCA & mclust for this analysis. I'm getting alerts after running a test model on a sample of the data using poLCA.

> library(poLCA)
> # Example data structure - actual test data has 200 rows:
> df <- structure(list(RevCd = c(274L, 320L, 320L, 450L, 450L, 450L, 
636L, 636L, 636L, 450L, 450L, 450L, 301L, 305L, 450L, 450L, 352L, 
301L, 300L, 636L, 301L, 450L, 636L, 636L, 307L, 450L, 300L, 300L, 
301L, 301L), PlaceofSvc = c(23L, 23L, 23L, 23L, 23L, 23L, 23L, 
23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 
23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L), TypOfSvc = c(51L, 
51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 
51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 
51L, 51L, 51L), FundType = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L), ProcCd2 = c(1747L, 656L, 656L, 1375L, 
1376L, 1439L, 1623L, 1645L, 1662L, 176L, 1374L, 1376L, 958L, 
1032L, 1368L, 1374L, 707L, 960L, 347L, 1662L, 859L, 1375L, 1654L, 
1783L, 882L, 1440L, 332L, 332L, 946L, 946L)), .Names = c("RevCd", 
"PlaceofSvc", "TypOfSvc", "FundType", "ProcCd2"), row.names = c(1137L, 
1138L, 1139L, 1140L, 1141L, 1142L, 1143L, 1144L, 1145L, 1146L, 
1147L, 1945L, 1946L, 1947L, 1948L, 1949L, 1950L, 1951L, 1952L, 
1953L, 1954L, 1955L, 1956L, 1957L, 1958L, 1959L, 2265L, 2266L, 
2267L, 2268L), class = "data.frame")

> clust <- poLCA(cbind(RevCd, PlaceofSvc, TypOfSvc, FundType, ProcCd2)~1, df, nclass = 3)

========================================================= 
Fit for 3 latent classes: 
========================================================= 
number of observations: 200 
number of estimated parameters: 7769 
residual degrees of freedom: -7569 
maximum log-likelihood: -1060.778 

AIC(3): 17659.56
BIC(3): 43284.18
G^2(3): 559.9219 (Likelihood ratio/deviance statistic) 
X^2(3): 33852.85 (Chi-square goodness of fit) 

ALERT: number of parameters estimated ( 7769 ) exceeds number of observations ( 200 ) 

ALERT: negative degrees of freedom; respecify model 

My novice assumption is that I need to run a greater number of iterations before I can get results that are robust? e.g. "…it is essential to run poLCA multiple times until you can
be reasonably certain that you have found the parameter estimates that produce the global
maximum likelihood solution." (http://www.sscnet.ucla.edu/polisci/faculty/lewis/pdf/poLCA-JSS-final.pdf). Alternatively, perhaps certain variables, particularly CPT & Revenue Codes, have too many unique values, and that I need to aggregate these variables into higher level categories to reduce the number of parameters?

When I run the model using package mclust, which optimizes the model based on BIC, I don't get any such alert.

> library(mclust)
> clustBIC <- mclustBIC(df)
> summary(clustBIC, data = df)

classification table:
      1   2 
     141  59 

 best BIC values:
        VEV,2     VEV,3     EEV,3 
      -4562.286 -4706.190 -5655.783

If anyone can shed a bit of light on the above alerts, it would be much appreciated. I was also planning on using the script found in the poLCA documentation to run multiple iterations of the model until the log-likelihood is maximized. However it's computationally intensive and I'm afraid the process will crash before I have a chance to post this. Sorry in advance if I've missed something obvious here; I'm new to cluster analysis.

Best Answer

and both performs Model-based cluster analysis, based on finite mixture models. However, is designed for Latent Class Analysis (LCA) which is the name for a particular class of mixture models suitable for categorical (polytomous) data. On the converse, estimates Gaussian mixtures, so is suitable for quantitative variables.

You should choose between the two classes of models by analyzing the nature and structure of your variables. Note that with LCA you are considering the variables as qualitative, that is, the information about the ordering of the modalities is ignored.

As regards to poLCA, you have too many unique values in each variable for the model to be identifiable. The number of independent parameters is related to the number of modalities (what you called unique values) of each variable and must be lower than the number of distinct configurations of the variables (in your case distinct observed 5-ples of outcomes among the units, which is $\leq 200$). In particular, if $m_a$, $m_b$, $m_c$ are the numbers of modalities for a 3-variables models with $k$ Latent Classes, then the number of independent parameters is: $$ (k-1)+ k\cdot[(m_a-1)+(m_b-1)+(m_c-1)] $$ So, yes: if you want to use LCA, you need to aggregate the modalities in order to reduce the number of parameters.

Btw, to run poLCA multiple times, you can simply use the nrep option.