I have a series of physicians' claims submissions. I would like to perform cluster analysis as an exploratory tool to find patterns in how physicians bill based on things like Revenue Codes, Procedure Codes, etc. The data are all polytomous, and from my basic understanding, a latent class algorithm is appropriate for this kind of data. I am trying my hand at some of R's cluster packages, & specifically poLCA
& mclust
for this analysis. I'm getting alerts after running a test model on a sample of the data using poLCA
.
> library(poLCA)
> # Example data structure - actual test data has 200 rows:
> df <- structure(list(RevCd = c(274L, 320L, 320L, 450L, 450L, 450L,
636L, 636L, 636L, 450L, 450L, 450L, 301L, 305L, 450L, 450L, 352L,
301L, 300L, 636L, 301L, 450L, 636L, 636L, 307L, 450L, 300L, 300L,
301L, 301L), PlaceofSvc = c(23L, 23L, 23L, 23L, 23L, 23L, 23L,
23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L,
23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L), TypOfSvc = c(51L,
51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L,
51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L,
51L, 51L, 51L), FundType = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L), ProcCd2 = c(1747L, 656L, 656L, 1375L,
1376L, 1439L, 1623L, 1645L, 1662L, 176L, 1374L, 1376L, 958L,
1032L, 1368L, 1374L, 707L, 960L, 347L, 1662L, 859L, 1375L, 1654L,
1783L, 882L, 1440L, 332L, 332L, 946L, 946L)), .Names = c("RevCd",
"PlaceofSvc", "TypOfSvc", "FundType", "ProcCd2"), row.names = c(1137L,
1138L, 1139L, 1140L, 1141L, 1142L, 1143L, 1144L, 1145L, 1146L,
1147L, 1945L, 1946L, 1947L, 1948L, 1949L, 1950L, 1951L, 1952L,
1953L, 1954L, 1955L, 1956L, 1957L, 1958L, 1959L, 2265L, 2266L,
2267L, 2268L), class = "data.frame")
> clust <- poLCA(cbind(RevCd, PlaceofSvc, TypOfSvc, FundType, ProcCd2)~1, df, nclass = 3)
=========================================================
Fit for 3 latent classes:
=========================================================
number of observations: 200
number of estimated parameters: 7769
residual degrees of freedom: -7569
maximum log-likelihood: -1060.778
AIC(3): 17659.56
BIC(3): 43284.18
G^2(3): 559.9219 (Likelihood ratio/deviance statistic)
X^2(3): 33852.85 (Chi-square goodness of fit)
ALERT: number of parameters estimated ( 7769 ) exceeds number of observations ( 200 )
ALERT: negative degrees of freedom; respecify model
My novice assumption is that I need to run a greater number of iterations before I can get results that are robust? e.g. "…it is essential to run poLCA multiple times until you can
be reasonably certain that you have found the parameter estimates that produce the global
maximum likelihood solution." (http://www.sscnet.ucla.edu/polisci/faculty/lewis/pdf/poLCA-JSS-final.pdf). Alternatively, perhaps certain variables, particularly CPT & Revenue Codes, have too many unique values, and that I need to aggregate these variables into higher level categories to reduce the number of parameters?
When I run the model using package mclust
, which optimizes the model based on BIC, I don't get any such alert.
> library(mclust)
> clustBIC <- mclustBIC(df)
> summary(clustBIC, data = df)
classification table:
1 2
141 59
best BIC values:
VEV,2 VEV,3 EEV,3
-4562.286 -4706.190 -5655.783
If anyone can shed a bit of light on the above alerts, it would be much appreciated. I was also planning on using the script found in the poLCA
documentation to run multiple iterations of the model until the log-likelihood is maximized. However it's computationally intensive and I'm afraid the process will crash before I have a chance to post this. Sorry in advance if I've missed something obvious here; I'm new to cluster analysis.
Best Answer
polca and mclust both performs Model-based cluster analysis, based on finite mixture models. However, polca is designed for Latent Class Analysis (LCA) which is the name for a particular class of mixture models suitable for categorical (polytomous) data. On the converse, mclust estimates Gaussian mixtures, so is suitable for quantitative variables.
You should choose between the two classes of models by analyzing the nature and structure of your variables. Note that with LCA you are considering the variables as qualitative, that is, the information about the ordering of the modalities is ignored.
As regards to poLCA, you have too many unique values in each variable for the model to be identifiable. The number of independent parameters is related to the number of modalities (what you called unique values) of each variable and must be lower than the number of distinct configurations of the variables (in your case distinct observed 5-ples of outcomes among the units, which is $\leq 200$). In particular, if $m_a$, $m_b$, $m_c$ are the numbers of modalities for a 3-variables models with $k$ Latent Classes, then the number of independent parameters is: $$ (k-1)+ k\cdot[(m_a-1)+(m_b-1)+(m_c-1)] $$ So, yes: if you want to use LCA, you need to aggregate the modalities in order to reduce the number of parameters.
Btw, to run poLCA multiple times, you can simply use the nrep option.