Solved – Discrete data & alternatives to PCA

correspondence-analysisdiscrete datamixed type datapca

I have a dataset of discrete (ordinal, meristic, and nominal) variables describing morphological wing characters on several closely related species of insects. What I'm looking to do is conduct some kind of analysis that would give me a visual representation of the similarity of the different species based on the morphological characteristics. The first thing that popped into my head was PCA (this is the type of visualization I'm looking to create), but after looking into it (particularly other questions such as: Can principal component analysis be applied to datasets containing a mix of continuous and categorical variables? ), it seems PCA may be inappropriate for discrete data (PCA is used in these types of studies in the literature, but always with continuous data). Ignoring the statistical background of why this data is inappropriate, PCA does give me relatively perfect results with regard to my biological question (hybrid groups of interest fall right in the middle of their paternal groups).

I've also tried multiple correspondence analysis to appease the statistics (at least as far as my understanding goes), but I cannot seem to get a plot that is analogous to one I would get with PCA, where my observations (the biological individuals) are separated say by color to show the different groupings (different species, biologically speaking). It seems that this analysis is aimed at describing how the variables (here, my morphological characteristics) are related to each other, not the individual observations. And when I plot observations colored by group, I only get a single value (perhaps an average) describing the whole set of individuals. I've done the analysis in R, so perhaps I'm also just not R-savy enough to get my idea of the plot to work.

Am I correct in trying this kind of analysis with my data, or am I way off track? If you could not tell, my statistical expertise is limited, so the equations happening underneath these analyses are all way over my head. I'm trying to conduct this analysis completely descriptively (I don't need to do any more downstream number crunching), and I've read that if this is the case, PCA will suffice, but want to make sure I'm not violating too many statistical assumptions.

Best Answer

It depends a little bit on your purpose, but if you're after a visualization tool there's a trick with applying multidimensional scaling to the output of random forest proximity which can produce pretty pictures and will work for a mixture of categorical and continuous data. Here you would classify the species according to your predictors. But - and it's a big caveat - I don't know if anyone really knows what the output to these visualizations mean.

Another alternative might be to apply multidimensional scaling to something like the Gower similarity.

There's a hanging question - what's your ultimate purpose? What question do you want to answer? I like these techniques as exploratory tools to perhaps lead you to asking more and better questions, but I'm not sure what they explain or tell you by themselves.

Maybe I'm reading too much into your question, but if you want to explore which predictor variables have the values for the hybrids sitting between the two pure species, you might be better building a model to estimate the values for the predictor variables which lead to the species and the hybrids directly. If you want to measure how the variables are related to each other, perhaps build a correlation matrix - and there are many neat visualizations for this.

Related Question