Solved – How to convert molecular categorical variables to dumthe variables for cluster analysis

categorical dataclusteringgeneticsmodel-based-clustering

I would like to use a clustering method, e.g. 'mclust', in R to classify each individual in my dataset to k groups. I have 7 continuous and 3 categorical variables. These and other hierarchical clustering methods do not allow for use of categorical variables. Searching Google and this site it appears that converting the categorical variables to dummy variables is an option.

Two of my categorical variables are SNP's (molecular markers). They are coded, 0 (fixed allele), 1 (heterozygous alleles), 2 (fixed allele). Do you have any suggestions on how to convert this to a dummy variable or am I incorrect in my thinking that the dummy approach is appropriate?

Best Answer

Yes, you can use dummy-coding since the 'representation' of SNP data for statistical analysis depends on the methods used and the underlying genetic model. When using PCA for unravelling population substructure or GWAS for modeling the association between SNPS and one or several phenotypes, each SNP is usually treated as a single integer-coded variable: under the "allelic dosage" model, with values in {0,1,2} coding for the frequency of the minor allele; under dominant or recessive effect, with the two extreme categories aggregated yielding a 0/1 response, etc. If you want to use multiple correspondence analysis or a method expecting discrete variables, it would make sense to use dummy coded variables.

I am aware of two cases where different approaches were retained. Waaijenborg and Zwinderman (1) used optimal scaling to transform SNP into one continuous variable as an input into a penalized canonical correlation analysis framework. This allows to consider the three different genotypes (AA, AB, BB) under four different genetic model of inheritance (additive, dominant, recessive or constant). Wolf et al. (2) used dummy-coded SNP as input to Logic Forest, where, for each SNP, the first dummy takes a value of one for 1+ copy of the minor allele (dominant effect) while the second dummy variable takes a value of one if individual was homozygous on the minor allele (recessive effect). In the latter approach, you can use whatever approach you think might best represent the underlying genetic models you want to consider.

References

  1. Waaijenborg and Zwinderman (2009). Correlating multiple SNPs and multiple disease phenotypes: penalized non-linear canonical correlation analysis. Bioinformatics 25(21): 2764-2771.
  2. Wolf, B.J., Hill, E.G., and Slate, E.H. (2010). Logic Forest: an ensemble classifier for discovering logical combinations of binary markers. Bioinformatics 26(17): 2183-2189.