Solved – LCA number of parameters & degrees of freedom

bicdegrees of freedomlatent-classmodel-based-clusteringr

I have a series of physicians' claims submissions. I would like to perform cluster analysis as an exploratory tool to find patterns in how physicians bill based on things like Revenue Codes, Procedure Codes, etc. The data are all polytomous, and from my basic understanding, a latent class algorithm is appropriate for this kind of data. I am trying my hand at some of R's cluster packages, & specifically poLCA & mclust for this analysis. I'm getting alerts after running a test model on a sample of the data using poLCA.

> library(poLCA)
> # Example data structure - actual test data has 200 rows:
> df <- structure(list(RevCd = c(274L, 320L, 320L, 450L, 450L, 450L, 
636L, 636L, 636L, 450L, 450L, 450L, 301L, 305L, 450L, 450L, 352L, 
301L, 300L, 636L, 301L, 450L, 636L, 636L, 307L, 450L, 300L, 300L, 
301L, 301L), PlaceofSvc = c(23L, 23L, 23L, 23L, 23L, 23L, 23L, 
23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 
23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L), TypOfSvc = c(51L, 
51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 
51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 
51L, 51L, 51L), FundType = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L), ProcCd2 = c(1747L, 656L, 656L, 1375L, 
1376L, 1439L, 1623L, 1645L, 1662L, 176L, 1374L, 1376L, 958L, 
1032L, 1368L, 1374L, 707L, 960L, 347L, 1662L, 859L, 1375L, 1654L, 
1783L, 882L, 1440L, 332L, 332L, 946L, 946L)), .Names = c("RevCd", 
"PlaceofSvc", "TypOfSvc", "FundType", "ProcCd2"), row.names = c(1137L, 
1138L, 1139L, 1140L, 1141L, 1142L, 1143L, 1144L, 1145L, 1146L, 
1147L, 1945L, 1946L, 1947L, 1948L, 1949L, 1950L, 1951L, 1952L, 
1953L, 1954L, 1955L, 1956L, 1957L, 1958L, 1959L, 2265L, 2266L, 
2267L, 2268L), class = "data.frame")

> clust <- poLCA(cbind(RevCd, PlaceofSvc, TypOfSvc, FundType, ProcCd2)~1, df, nclass = 3)

========================================================= 
Fit for 3 latent classes: 
========================================================= 
number of observations: 200 
number of estimated parameters: 7769 
residual degrees of freedom: -7569 
maximum log-likelihood: -1060.778 

AIC(3): 17659.56
BIC(3): 43284.18
G^2(3): 559.9219 (Likelihood ratio/deviance statistic) 
X^2(3): 33852.85 (Chi-square goodness of fit) 

ALERT: number of parameters estimated ( 7769 ) exceeds number of observations ( 200 ) 

ALERT: negative degrees of freedom; respecify model

My novice assumption is that I need to run a greater number of iterations before I can get results that are robust? e.g. "…it is essential to run poLCA multiple times until you can
be reasonably certain that you have found the parameter estimates that produce the global
maximum likelihood solution." (http://www.sscnet.ucla.edu/polisci/faculty/lewis/pdf/poLCA-JSS-final.pdf). Alternatively, perhaps certain variables, particularly CPT & Revenue Codes, have too many unique values, and that I need to aggregate these variables into higher level categories to reduce the number of parameters?

When I run the model using package mclust, which optimizes the model based on BIC, I don't get any such alert.

> library(mclust)
> clustBIC <- mclustBIC(df)
> summary(clustBIC, data = df)

classification table:
      1   2 
     141  59 

 best BIC values:
        VEV,2     VEV,3     EEV,3 
      -4562.286 -4706.190 -5655.783

If anyone can shed a bit of light on the above alerts, it would be much appreciated. I was also planning on using the script found in the poLCA documentation to run multiple iterations of the model until the log-likelihood is maximized. However it's computationally intensive and I'm afraid the process will crash before I have a chance to post this. Sorry in advance if I've missed something obvious here; I'm new to cluster analysis.

Best Answer

polca and mclust both performs Model-based cluster analysis, based on finite mixture models. However, polca is designed for Latent Class Analysis (LCA) which is the name for a particular class of mixture models suitable for categorical (polytomous) data. On the converse, mclust estimates Gaussian mixtures, so is suitable for quantitative variables.

You should choose between the two classes of models by analyzing the nature and structure of your variables. Note that with LCA you are considering the variables as qualitative, that is, the information about the ordering of the modalities is ignored.

As regards to poLCA, you have too many unique values in each variable for the model to be identifiable. The number of independent parameters is related to the number of modalities (what you called unique values) of each variable and must be lower than the number of distinct configurations of the variables (in your case distinct observed 5-ples of outcomes among the units, which is $\leq 200$). In particular, if $m_a$, $m_b$, $m_c$ are the numbers of modalities for a 3-variables models with $k$ Latent Classes, then the number of independent parameters is: $$ (k-1)+ k\cdot[(m_a-1)+(m_b-1)+(m_c-1)] $$ So, yes: if you want to use LCA, you need to aggregate the modalities in order to reduce the number of parameters.

Btw, to run poLCA multiple times, you can simply use the nrep option.

Related Solutions

Solved – R clustering using mclust: BIC are often NA

I have noted this before in another question here.

My guess is that in some cases the model is over-parametrized and the model cannot be constructed. Or it is possible that the data needs regularization (adding a small constant to the co-variance matrices) to make them invertible.

Please let us know if you find this or other reasons and/or any work around.

BTW, scaling data to have mean zero and standard deviation of one does not make it uniform but actually preserves normality. Here is an example for that:

x <- rnorm(1000, mean=10, sd=3)
par(mfrow=c(1,2))
hist(x)
hist(scale(x))

EDIT

Looking at the Mclust documentation (page 13) enabling the conjugate prior withe the argument "prior=priorControl()" should produce fewer missing BIC values.

Solved – AIC of ridge regression: degrees of freedom vs. number of parameters

AIC and ridge regression can be made compatible when certain assumptions are made. However, there is no single method of choosing a shrinkage for ridge regression thus no general method of applying AIC to it. Ridge regression is a subset of Tikhonov regularization. There are many criteria that can be applied to selecting smoothing factors for Tikhonov regularization, e.g., see this. To use AIC in that context, there is a paper that makes rather specific assumptions as to how to perform that regularization, Information complexity-based regularization parameter selection for solution of ill conditioned inverse problems. In specific, this assumes

"In a statistical framework, ...choosing the value of the regularization parameter α, and by using the maximum penalized likelihood (MPL) method....If we consider uncorrelated Gaussian noise with variance $\sigma ^2$ and use the penalty $p(x) =$ a complicated norm, see link above, the MPL solution is the same as the Tikhonov (1963) regularized solution."

The question then becomes, should those assumptions be made? The question of degrees of freedom needed is secondary to the question of whether or not AIC and ridge regression are used in a consistent context. I would suggest reading the link for details. I am not avoiding the question, it is just that one can use lots of things as ridge targets, for example, one could use the smoothing factor that optimizes AIC itself. So, one good question deserves another, "Why bother with AIC in a ridge context?" In some ridge regression contexts, it is difficult to see how AIC could be made relevant. For example, ridge regression has been applied in order to minimize the relative error propagation of $b$, that is, min$\left [ \dfrac{\text{SD}(b)}{b}\right ]$ of the gamma distribution (GD) given by

$$ \text{GD}(t; a,b) = \,\dfrac{1}{t}\;\dfrac{e^{-b \, t}(b \, t)^{\,a} }{\Gamma (a)} \;\; \;;\hspace{2em}t\geq 0 \;\; \;\;,\\ %\tabularnewline $$

as per this paper. In particular, this difficulty arises because in that paper, it is, in effect, the Area Under the $[0,\infty)$ time Curve (AUC) that is optimized, and not the maximum likelihood (ML) of goodness of fit between measured $[t_1,t_n]$ time-samples. To be clear, that is done because the AUC is an ill-posed integral, and, otherwise, e.g., using ML, the gamma distribution fit would lack robustness for a time series that is censored (e.g., the data stops at some maximum time, and ML does not cover that case). Thus, for that particular application, maximum-likelihood, thus AIC, is actually irrelevant. (It is said that AIC is used for prediction and BIC for goodness-of-fit. However, prediction and goodness-of-fit are both only rather indirectly related to a robust measure of AUC.)

As for the answer to the question, the first reference in the question text says that "The main point is to note that $df$ is a decreasing function of $\lambda$ [Sic, the smoothing factor] with $df = p$ [Sic, the effective number of parameters see trace of hat matrix below] at $\lambda = 0$ and $df = 0$ at $\lambda=\infty$." Which means that $df$ equals the number of parameters minus the number of quantities estimated, when there is no smoothing which is also when the regression is the same as ordinary least squares and decreases to no $df$ as the smoothing factor increases to $\infty$. Note that for infinite smoothing the fit is a flat line irrespective of what density function is being fit. Finally, that the exact number of $df$ is a function.

"One can show that $df_{ridge}= \sum(\lambda_i / (\lambda_i + \lambda$ ), where {$\lambda_i$} are the eigenvalues of $X^{\text{T}} X$." Interestingly, that same reference defines $df$ as the trace of the hat matrix, see def.

Best Answer

Related Solutions

Solved – R clustering using mclust: BIC are often NA

Solved – AIC of ridge regression: degrees of freedom vs. number of parameters

Related Question