Solved – Clustering of variables: but they are mixed type, some are numeric, some are categorical

categorical dataclusteringcontinuous datamixed type data

I have a dataset with 15 variables. Some variables are numeric, continuous. Other variables are boolean, dichotomous (true/false). There's also one variable categorical, nominal.

str(df) 'data.frame': 30 obs. of 15 variables:
    nom : Factor w/ 3 levels "a","b","c": 1 1 1 1 1 1 1 1 1 1 ... 
    X1  : logi  FALSE TRUE FALSE TRUE TRUE FALSE ...
    X3  : logi  TRUE TRUE TRUE TRUE FALSE FALSE ... 
    X3  : logi  TRUE FALSE FALSE FALSE TRUE FALSE ... 
    X4  : logi  FALSE TRUE FALSE TRUE FALSE FALSE ... 
    X5  : logi  TRUE FALSE FALSE FALSE FALSE TRUE ... 
    X1.1: num  1.026 -0.285 -1.221 0.181 -0.139 ... 
    X2.1: num  -0.045 -0.785 -1.668 -0.38 0.919 ... 
    X3.1: num  1.13 -1.46 0.74 1.91 -1.44 ... 
    X4.1: num  0.298 0.637 -0.484 0.517 0.369 ... 
    X5.1: num  1.997 0.601 -1.251 -0.611 -1.185 ... 
    X6  : num  0.0597 -0.7046 -0.7172 0.8847 -1.0156 ... 
    X7  : num  -0.0886 1.0808 0.6308 -0.1136 -1.5329 ...
    X8  : num  0.134 0.221 1.641 -0.219 0.168 ... 
    X9  : num  0.704 -0.106 -1.259 1.684 0.911 ..
    X10 : android android OS windows7 windows8...
    [...]

I would like to cluster the variables (not data cases) x1, x2, ..., x9 (probably omitting the nominal X10) into clusters or subsets of correlated variables, for example (x1,x2,x6),(x3,x5), ...

As the variable have mixed types, it is impossible to use cor(), I think. It is also impossible to use Gower similarity coefficient, because it is a similarity between data cases.

Can you help me to find an idea to process this, please? I would prefer a solution in R.

Best Answer

Traditional FA and cluster algorithms were designed for use with continuous (i.e., gaussian) variables. Mixtures of continuous and qualitative variables invariably give erroneous results. In particular and in my experience, the categorical information will dominate the solution.

A better approach would be to employ a variant of finite mixture models which are often intended for use with mixtures of continuous and categorical information. Latent class mixture models (which are FMMs) have a huge literature built up around them. Much of that literature is focused in the field of marketing science where these methods see wide use for, e.g., consumer segmentation...but that's not the only field where they are used.

The software I know and recommend for latent class modeling is neither free nor R-based but, in terms of proprietary software, it's not that expensive. It's called Latent Gold, is sold by Statistical Innovations and costs about $1,000 for a perpetual license. If your project has a budget, it could easily be expensed. LG offers a wide suite of tools including FA for mixtures, clustering of mixtures, longitudinal markov chain-based clustering, and more.

Otherwise, the only R-based freeware I know about (polCA, https://www.jstatsoft.org/article/view/v042i10) is intended for use with multi-way contingency tables. I'm not aware that this tool can accept anything other than categorical information. There may be others. If you poke around, maybe you can find some alternatives.

Related Question